R Tutorial at the WZB

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

Remember that you can load CSV files with read.csv(). Make sure to not read in strings as factors and to read the postal code (variable personal.location.postal_code) as character string, not as numeric variable. Hint: Have a look at help for read.csv() and look for the arguments stringsAsFactors and colClasses.

This is a data subset on members of the 19th (2017 to 2021 max.) German Bundestag (ger. “Mitglieder des Bundestags – MdB”) that was fetched from abgeordnetenwatch.de.

Familiarize yourself with the data. There is no codebook to the data, which is bad practice, but for the sake of this exercise it should be enough to assess the variables from the column names and their values. Also check the number of observations.

library(tidyverse)   # for dplyr and ggplot2

mdb <- read.csv('11collecting-resources/mdb_twitter.csv',
                stringsAsFactors = FALSE,
                colClasses = c(personal.location.postal_code = "character"))
head(mdb)

##   meta.status      meta.edited                            meta.uuid
## 1           1 2017-11-17 12:45 c2d05a34-99a4-4a37-9b49-4f7b118d7f38
## 2           1 2017-11-21 15:37 04b9aad2-cf6c-4411-a27f-9fda675c5d3d
## 3           1 2017-11-21 16:46 aa9a5c1f-ab0e-45e9-ab61-becd66f3660f
## 4           1 2017-11-24 12:25 c0aaa650-de62-41a9-9ace-6cf337be0202
## 5           1 2017-12-12 08:53 22bdef04-58be-4c1c-a795-448f7db42ffe
##          meta.username meta.questions meta.answers meta.standard_replies
## 1          detlef-seif              1            0                     1
## 2 klaus-dieter-grohler              4            1                     0
## 3           dirk-wiese              0            0                     0
## 4           gabi-weber              2            0                     0
## 5           dirk-vopel              0            0                     0
##                                                        meta.url
## 1          https://www.abgeordnetenwatch.de/profile/detlef-seif
## 2 https://www.abgeordnetenwatch.de/profile/klaus-dieter-grohler
## 3           https://www.abgeordnetenwatch.de/profile/dirk-wiese
## 4           https://www.abgeordnetenwatch.de/profile/gabi-weber
## 5           https://www.abgeordnetenwatch.de/profile/dirk-vopel
##   personal.degree personal.first_name personal.last_name personal.gender
## 1            <NA>              Detlef               Seif            <NA>
## 2            <NA>        Klaus-Dieter            Gröhler            male
## 3            <NA>                Dirk              Wiese            male
## 4            <NA>                Gabi              Weber          female
## 5            <NA>                Dirk              Vöpel            male
##   personal.birthyear                personal.education personal.profession
## 1               1962  Studium der Rechtswissenschaften   MdB, Rechtsanwalt
## 2               1966  Studium der Rechtswissenschaften                 MdB
## 3               1983                            Jurist                 MdB
## 4               1955                    Keramikmalerin                 MdB
## 5               1971 selbstständiger IT-Systembetreuer                 MdB
##   personal.location.country personal.location.state personal.location.city
## 1                        DE     Nordrhein-Westfalen            Weilerswist
## 2                        DE                  Berlin                 Berlin
## 3                        DE     Nordrhein-Westfalen                 Brilon
## 4                        DE         Rheinland-Pfalz                 Wirges
## 5                        DE     Nordrhein-Westfalen             Oberhausen
##   personal.location.postal_code
## 1                              
## 2                              
## 3                              
## 4                              
## 5                              
##                                                                             personal.picture.url
## 1           https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/detlef_seif_68.jpg
## 2 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/klaus_dieter_groehler_32.jpg
## 3         https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/users/kampagne17.jpg
## 4         https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/weber_gabi_klein.jpg
## 5           https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/dirk_voepel_14.jpg
##   personal.picture.copyright     party parliament.name
## 1                                  CDU       Bundestag
## 2                                  CDU       Bundestag
## 3                                  SPD       Bundestag
## 4                                  SPD       Bundestag
## 5                                  SPD       Bundestag
##                        parliament.uuid parliament.joined twitter_name
## 1 60d0787f-e311-4283-a7fd-85b9f62a9b33              <NA>         <NA>
## 2 60d0787f-e311-4283-a7fd-85b9f62a9b33              <NA>         <NA>
## 3 60d0787f-e311-4283-a7fd-85b9f62a9b33              <NA>   dirkwiese4
## 4 60d0787f-e311-4283-a7fd-85b9f62a9b33              <NA> gabiweberspd
## 5 60d0787f-e311-4283-a7fd-85b9f62a9b33              <NA>         <NA>
##  [ reached getOption("max.print") -- omitted 1 row ]

dim(mdb)

## [1] 708  26

Data with 27 variables on 708 members (Wikipedia says it’s 709 members in the BT, but let’s not be pedantic for that exercise).

2. Create a subset of the data with the following criteria:

select only the variables personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name
do not include members without party affiliation (“fraktionslos”)

After that, have a look at where we have missing data in the data set that might later cause trouble.

mdb <- mdb %>% select(personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name) %>%
  filter(party != 'fraktionslos')
head(mdb)

##   personal.first_name personal.last_name personal.gender
## 1              Detlef               Seif            <NA>
## 2        Klaus-Dieter            Gröhler            male
## 3                Dirk              Wiese            male
## 4                Gabi              Weber          female
## 5                Dirk              Vöpel            male
## 6             Kersten            Steinke            <NA>
##   personal.birthyear     party twitter_name
## 1               1962       CDU         <NA>
## 2               1966       CDU         <NA>
## 3               1983       SPD   dirkwiese4
## 4               1955       SPD gabiweberspd
## 5               1971       SPD         <NA>
## 6               1958 DIE LINKE         <NA>

Three “fraktionslos” members were removed from the data set:

nrow(mdb)

## [1] 705

sum(is.na(mdb$personal.gender))

## [1] 176

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

3.1. First, think how you could do imputation on that variable and what the pros and cons of each method are.

We could do manual imputation either by only looking at the name (which could be inaccurate) or by checking the profiles / websites / etc. of each member, which would be the most accurate but also the most laborious way. We could also do it automatically, for example using the genderizeR package.

3.2. Then, inform yourself about the genderizeR package and install it.

Please note that the package uses the genderize.io API to predict the gender of a person using a first name. There is a rate limit of 1000 requests (i.e. 1000 names) per day.

The package is a bit awkward to work with, I have to say. An alternative would be to use the web API directly from R without the package, which would also allow to use additional features such as specifying the country from which the name comes (that increases accuracy). Anyway, that would lead too far, so let’s use the package.

To predict gender from a set of names with this package, you need to use two functions: 1. findGivenNames() and 2. genderize().

findGivenNames() extracts a first name from a text (in our case, that’s not really necessary because we already have the first name in a separate data frame column), does some text preparation (like lower case transformation), queries the web service of genderize.io and returns the result in a data frame. Note that it also includes an estimate of the correct prediction probability:

library(genderizeR)

testnames <- c('Paula', 'Paul')
test_genderinfo <- findGivenNames(testnames, progress = FALSE)

test_genderinfo

##     name gender probability count
## 1: paula female        0.99  2298
## 2:  paul   male           1  5931

Now we need to pass this result to genderize(). This is basically only necessary to combine the original names in testnames with the test_genderinfo output.

genderize(testnames, test_genderinfo, progress = FALSE)

##     text givenName gender genderIndicators
## 1: Paula     paula female                1
## 2:  Paul      paul   male                1

We can later use the original name in the text column for joining.

3.3. Try out the generizeR as shown above with a few names.

3.4. Next, select all first names from the data set of Bundestag members, for which we do not have gender information. Make sure to remove all duplicate names. Then, use the findGivenNames() function to let the web service predict the gender from the supplied names. Inspect the result data frame.

firstnames <- filter(mdb, is.na(personal.gender)) %>% pull(personal.first_name) %>% unique()
genderinfo <- findGivenNames(firstnames, progress = FALSE)

genderinfo

##          name gender probability count
##   1:   detlef   male           1    10
##   2:  kersten female           1     3
##   3:   detlev   male           1     2
##   4:    peter   male           1  4373
##   5:    marco   male        0.99  2493
##  ---                                  
## 125:    björn   male           1   194
## 126: dorothee female           1    36
## 127:  claudia female           1  3051
## 128:     omid   male           1    41
## 129:   volker   male           1    31

3.5. Have a look at the distribution of the probability variable in the result. Filter out results with a probability less than 0.9 and investigate them.

qplot(genderinfo$probability)

filter(genderinfo, probability < 0.9)

##       name gender probability count
## 1   lothar   male        0.89     9
## 2      jan   male         0.6  1663
## 3   gerold   male         0.8     5
## 4   simone female        0.55  1086
## 5 gabriele   male        0.81   250
## 6      kai   male        0.87   207

Looks like although the probability for a correct prediction is low for certain names, the predictions still seem reasonable, so we don’t need to filter out any of them.

3.6. Then, use genderize() to get the complete data frame which also contains the original first names in the text variable. You should then have a data frame with the original first name and a predicted gender. Use this data frame now for imputation of the data set of Bundestag members. Check if there are still NAs in the personal.gender variable. If yes, you can manually set a gender if you’re confident or just ignore it.

Hint: A strategy here is to use a join (as we learned in “data linkage”) where the variables personal.first_name from the members data and text from the predicted gender data must match. You can then replace NA values in personal.gender for example by using mutate() and ifelse() (there are also other ways).

Getting the full gender information data frame:

genderinfo_complete <- genderize(firstnames, genderinfo, progress = FALSE)

genderinfo_complete

##          text givenName gender genderIndicators
##   1:   Detlef    detlef   male                1
##   2:  Kersten   kersten female                1
##   3:   Detlev    detlev   male                1
##   4:    Peter     peter   male                1
##   5:    Marco     marco   male                1
##  ---                                           
## 128:    Björn     björn   male                1
## 129: Dorothee  dorothee female                1
## 130:  Claudia   claudia female                1
## 131:     Omid      omid   male                1
## 132:   Volker    volker   male                1

Joining it with the Bundestag members data and replacing NAs:

mdb <- left_join(mdb, genderinfo_complete, by = c('personal.first_name' = 'text')) %>%
  # next we replace all NAs in personal.gender from mdb with predicted gender from genderinfo_complete
  # so "gender" from genderinfo_complete will be used when the
  # condition "is.na(personal.gender) & !is.na(gender)" is TRUE
  # otherwise keep the original value from "personal.gender"
  mutate(personal.gender = ifelse(is.na(personal.gender) & !is.na(gender), gender, personal.gender)) %>%
  select(-c(givenName, gender, genderIndicators))  # we don't need these columns any more
mdb

##     personal.first_name  personal.last_name personal.gender
## 1                Detlef                Seif            male
## 2          Klaus-Dieter             Gröhler            male
## 3                  Dirk               Wiese            male
## 4                  Gabi               Weber          female
## 5                  Dirk               Vöpel            male
## 6               Kersten             Steinke          female
## 7                Ursula             Schulte          female
## 8                  Axel             Schäfer            male
## 9          Ernst Dieter            Rossmann            male
## 10               Susann            Rüthrich          female
## 11             Johannes              Röring            male
## 12                 René              Röspel            male
## 13                 Swen              Schulz            male
## 14               Thomas              Rachel            male
## 15                Alois              Rainer            male
## 16                Achim                Post            male
## 17               Detlev              Pilger            male
## 18              Henning                Otte            male
## 19              Andreas                Nick            male
## 20            Friedrich          Ostendorff            male
## 21              Susanne              Mittag          female
## 22               Sabine              Leidig          female
## 23            Katharina            Landgraf          female
## 24                  Roy               Kühne            male
## 25            Christian                Kühn            male
##     personal.birthyear      party    twitter_name
## 1                 1962        CDU            <NA>
## 2                 1966        CDU            <NA>
## 3                 1983        SPD      dirkwiese4
## 4                 1955        SPD    gabiweberspd
## 5                 1971        SPD            <NA>
## 6                 1958  DIE LINKE            <NA>
## 7                 1952        SPD            <NA>
## 8                 1952        SPD            <NA>
## 9                 1951        SPD      edrossmann
## 10                1977        SPD susannruethrich
## 11                1959        CDU            <NA>
## 12                1964        SPD            <NA>
## 13                1968        SPD      swenschulz
## 14                1962        CDU            <NA>
## 15                1965        CSU            <NA>
## 16                1959        SPD         achim_p
## 17                1955        SPD    detlevpilger
## 18                1968        CDU            <NA>
## 19                1967        CDU   drandreasnick
## 20                1953 DIE GRÜNEN     fostendorff
## 21                1958        SPD            <NA>
## 22                1961  DIE LINKE    sabineleidig
## 23                1954        CDU            <NA>
## 24                1967        CDU   dr_roy_kuehne
## 25                1979 DIE GRÜNEN  chriskuehn_mdb
##  [ reached getOption("max.print") -- omitted 680 rows ]

Check number of NAs:

sum(is.na(mdb$personal.gender))

## [1] 1

Check who’s that:

filter(mdb, is.na(personal.gender))

##   personal.first_name personal.last_name personal.gender
## 1             Siemtje             Möller            <NA>
##   personal.birthyear party twitter_name
## 1               1983   SPD         <NA>

After some manual research, do a manual imputation:

mdb[is.na(mdb$personal.gender),]$personal.gender <- 'female'
sum(is.na(mdb$personal.gender))

## [1] 0

3.7. Make a final data frame of the Bundestag members by converting personal.gender and party to factors (makes further analysis easier).

mdb <- mutate(mdb, personal.gender = as.factor(personal.gender), party = as.factor(party))

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

ratio_twitter_party <- group_by(mdb, party) %>%
  summarise(n_members = n(),
  n_twitter = sum(!is.na(twitter_name)),
  ratio_twitter = n_twitter / n_members)
ratio_twitter_party

## # A tibble: 7 x 4
##   party      n_members n_twitter ratio_twitter
##   <fct>          <int>     <int>         <dbl>
## 1 AfD               92        54         0.587
## 2 CDU              199        73         0.367
## 3 CSU               46        17         0.370
## 4 DIE GRÜNEN        67        56         0.836
## 5 DIE LINKE         69        49         0.710
## 6 FDP               80        46         0.575
## 7 SPD              152        99         0.651

ggplot(ratio_twitter_party, aes(x = party, y = ratio_twitter)) + geom_col()

Subset with Twitter users:

mdb_twitter <- filter(mdb, !is.na(twitter_name))
nrow(mdb_twitter)

## [1] 394

5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.

From now on, work only with the subset that you created previously (Bundestag members with Twitter account), because we want to fetch some metadata from Twitter. If you want to do that task, you will need to install the package rtweet and load it. You also need a Twitter account (you don’t have to tweet anything, though!) and need to set up an “App” at developer.twitter.com. You can then find the four authentication keys that you need for accessing the Twitter API in the “Keys and tokens” tab of the app that you created (as shown in the tutorial).

5.1. After that, use the keys to create an access token with create_token(). In order to check if it works, look up data for a single Twitter account, e.g. “WZB_Berlin” using the function lookup_users(). Investigate the result in terms of which data is returned. What could be interesting for us?

library(rtweet)

source('twitterkeys.R')

token <- create_token(
    app = "WZBAnalysis",
    consumer_key = consumer_key,
    consumer_secret = consumer_secret,
    access_token = access_token,
    access_secret = access_secret)

lookup_users('WZB_Berlin')

## # A tibble: 1 x 88
##   user_id status_id created_at          screen_name text  source
## * <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 459107… 10732269… 2018-12-13 14:43:32 WZB_Berlin  "@Pe… Tweet…
## # ... with 82 more variables: display_text_width <int>,
## #   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
## #   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <list>,
## #   symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## #   quoted_followers_count <int>, quoted_friends_count <int>,
## #   quoted_statuses_count <int>, quoted_location <chr>,
## #   quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>,
## #   retweet_created_at <dttm>, retweet_source <chr>,
## #   retweet_favorite_count <int>, retweet_retweet_count <int>,
## #   retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>,
## #   country <chr>, country_code <chr>, geo_coords <list>,
## #   coords_coords <list>, bbox_coords <list>, status_url <chr>,
## #   name <chr>, location <chr>, description <chr>, url <chr>,
## #   protected <lgl>, followers_count <int>, friends_count <int>,
## #   listed_count <int>, statuses_count <int>, favourites_count <int>,
## #   account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## #   profile_expanded_url <chr>, account_lang <chr>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

A data frame with 88 variables is returned. It also includes the latest tweet of this account and some meta data for that tweet. However, what is particularly interesting are variables like followers_count, friends_count (the number of accounts this user follows) or statuses_count (the number of tweets sent since account creation which in turn is stored in account_created_at).

5.2. Now look up user data for all Twitter account names in the Bundestag members data set (note that this will take several seconds) and store the result.

userdata <- lookup_users(mdb_twitter$twitter_name)
userdata

## # A tibble: 378 x 88
##    user_id status_id created_at          screen_name text  source
##  * <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
##  1 815469… 10735648… 2018-12-14 13:06:32 DirkWiese4  Inte… Twitt…
##  2 992808… 10729332… 2018-12-12 19:16:28 GabiWeberS… Blic… Faceb…
##  3 386459… 10731567… 2018-12-13 10:04:47 edrossmann  @Yan… Twitt…
##  4 146199… 46706835… 2014-05-15 22:25:47 SusannRuet… mit … Twitt…
##  5 217916… 10732029… 2018-12-13 13:08:26 swenschulz  "Sie… Twitt…
##  6 111774… 10732633… 2018-12-13 17:08:17 Achim_P     Im I… Twitt…
##  7 494500… 10735656… 2018-12-14 13:09:43 DrAndreasN… Pein… Twitt…
##  8 862714… 10731539… 2018-12-13 09:53:45 FOstendorff "...… Twitt…
##  9 257461… 10735546… 2018-12-14 12:26:00 SabineLeid… http… Hoots…
## 10 190691… 10716930… 2018-12-09 09:08:33 Dr_Roy_Kue… @Ilm… Twitt…
## # ... with 368 more rows, and 82 more variables: display_text_width <int>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <list>,
## #   symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## #   quoted_favorite_count <int>, quoted_retweet_count <int>,
## #   quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## #   quoted_followers_count <int>, quoted_friends_count <int>,
## #   quoted_statuses_count <int>, quoted_location <chr>,
## #   quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>,
## #   retweet_created_at <dttm>, retweet_source <chr>,
## #   retweet_favorite_count <int>, retweet_retweet_count <int>,
## #   retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>,
## #   country <chr>, country_code <chr>, geo_coords <list>,
## #   coords_coords <list>, bbox_coords <list>, status_url <chr>,
## #   name <chr>, location <chr>, description <chr>, url <chr>,
## #   protected <lgl>, followers_count <int>, friends_count <int>,
## #   listed_count <int>, statuses_count <int>, favourites_count <int>,
## #   account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## #   profile_expanded_url <chr>, account_lang <chr>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

5.3. Create a subset of the Twitter user data with only the variables screen_name, account_created_at, followers_count, statuses_count. If you wanted to join this data with the Bundestag members data, which variable would you use to match the observations?

twitter_name from the Bundestag members and the screen_name from the Twitter API data must match for joining. We must watch out that both are all lower case (twitter_name already is).

5.4. Transform the data in the subset using mutate() so that 1) the screen_name is always lower case; 2) you have a new variable days_since_creation containing the “age” of the account in days calculated with round(as.numeric(as.POSIXct('2018-12-13') - account_created_at)); 3) another new variable tweets_per_day that is the average number of tweets sent per day by this account (calculate it with the help of days_since_creation). We can then use tweets_per_day as a measure of activity on Twitter.

usersub <- select(userdata, screen_name, account_created_at, followers_count, statuses_count) %>%  # subset
  # create the new variables
  mutate(screen_name = tolower(screen_name),               # this is the date the data was collected
         days_since_creation = round(as.numeric(as.POSIXct('2018-12-13') - account_created_at)),
         tweets_per_day = statuses_count / days_since_creation)
usersub

## # A tibble: 378 x 6
##    screen_name account_created_at  followers_count statuses_count
##    <chr>       <dttm>                        <int>          <int>
##  1 dirkwiese4  2017-01-01 08:06:30            1643           3065
##  2 gabiwebers… 2012-12-06 10:29:09            3224           2776
##  3 edrossmann  2009-05-08 11:17:22            2360            590
##  4 susannruet… 2013-05-27 12:42:27             620             19
##  5 swenschulz  2009-02-24 20:14:10            3724           1136
##  6 achim_p     2013-01-24 21:12:00            3342           5082
##  7 drandreasn… 2009-06-21 22:26:05            3970          19374
##  8 fostendorff 2012-10-05 08:17:20            3148           1021
##  9 sabineleid… 2011-02-25 13:33:26            6970           6957
## 10 dr_roy_kue… 2013-09-26 06:13:48            1523            529
## # ... with 368 more rows, and 2 more variables: days_since_creation <dbl>,
## #   tweets_per_day <dbl>

5.5. Join the data of the Bundestag members with Twitter account and the data fetched from the Twitter API by matching twitter_name with screen_name. Check where the matching failed (i.e. because no data could be fetched for that account from Twitter).

mdb_twitterstats <- left_join(mdb_twitter, usersub, by = c("twitter_name" = "screen_name"))
mdb_twitterstats

##     personal.first_name  personal.last_name personal.gender
## 1                  Dirk               Wiese            male
## 2                  Gabi               Weber          female
## 3          Ernst Dieter            Rossmann            male
## 4                Susann            Rüthrich          female
## 5                  Swen              Schulz            male
## 6                 Achim                Post            male
## 7                Detlev              Pilger            male
## 8               Andreas                Nick            male
## 9             Friedrich          Ostendorff            male
## 10               Sabine              Leidig          female
## 11                  Roy               Kühne            male
## 12            Christian                Kühn            male
## 13                 Maik            Beermann            male
##     personal.birthyear      party    twitter_name  account_created_at
## 1                 1983        SPD      dirkwiese4 2017-01-01 08:06:30
## 2                 1955        SPD    gabiweberspd 2012-12-06 10:29:09
## 3                 1951        SPD      edrossmann 2009-05-08 11:17:22
## 4                 1977        SPD susannruethrich 2013-05-27 12:42:27
## 5                 1968        SPD      swenschulz 2009-02-24 20:14:10
## 6                 1959        SPD         achim_p 2013-01-24 21:12:00
## 7                 1955        SPD    detlevpilger                <NA>
## 8                 1967        CDU   drandreasnick 2009-06-21 22:26:05
## 9                 1953 DIE GRÜNEN     fostendorff 2012-10-05 08:17:20
## 10                1961  DIE LINKE    sabineleidig 2011-02-25 13:33:26
## 11                1967        CDU   dr_roy_kuehne 2013-09-26 06:13:48
## 12                1979 DIE GRÜNEN  chriskuehn_mdb 2013-02-02 18:49:01
## 13                1981        CDU    maikbeermann 2012-05-02 17:01:13
##     followers_count statuses_count days_since_creation tweets_per_day
## 1              1643           3065                 711   4.310830e+00
## 2              3224           2776                2198   1.262966e+00
## 3              2360            590                3505   1.683310e-01
## 4               620             19                2025   9.382716e-03
## 5              3724           1136                3578   3.174958e-01
## 6              3342           5082                2148   2.365922e+00
## 7                NA             NA                  NA             NA
## 8              3970          19374                3461   5.597804e+00
## 9              3148           1021                2260   4.517699e-01
## 10             6970           6957                2847   2.443625e+00
## 11             1523            529                1904   2.778361e-01
## 12             2273           1759                2139   8.223469e-01
## 13             1860           2466                2415   1.021118e+00
##  [ reached getOption("max.print") -- omitted 381 rows ]

Find out how often the matching failed:

sum(is.na(mdb_twitterstats$followers_count))

## [1] 16

And where it did:

filter(mdb_twitterstats, is.na(followers_count))

##    personal.first_name personal.last_name personal.gender
## 1               Detlev             Pilger            male
## 2              Andreas             Rimkus            male
## 3             Matthias            Büttner            male
## 4               Sandra             Weeser          female
## 5              Stephan          Protschka            male
## 6                Doris          Achelwilm          female
## 7                Bernd            Reuther            male
## 8              Manfred              Grund            male
## 9             Johannes            Schraps            male
## 10            Eckhardt            Rehberg            male
## 11               Gökay            Akbulut          female
## 12                René           Springer            male
## 13            Torbjörn             Kartes            male
##    personal.birthyear     party    twitter_name account_created_at
## 1                1955       SPD    detlevpilger               <NA>
## 2                1962       SPD   spduesseldorf               <NA>
## 3                1990       AfD     buettnersdl               <NA>
## 4                1969       FDP    weesersandra               <NA>
## 5                1977       AfD protschkastepha               <NA>
## 6                1976 DIE LINKE doris_achelwilm               <NA>
## 7                1971       FDP    reutherbernd               <NA>
## 8                1955       CDU   manfred_grund               <NA>
## 9                1983       SPD     jojoschraps               <NA>
## 10               1954       CDU eckhardtrehberg               <NA>
## 11               1982 DIE LINKE   goekayakbulut               <NA>
## 12               1979       AfD     springerren               <NA>
## 13               1979       CDU torbjoernkartes               <NA>
##    followers_count statuses_count days_since_creation tweets_per_day
## 1               NA             NA                  NA             NA
## 2               NA             NA                  NA             NA
## 3               NA             NA                  NA             NA
## 4               NA             NA                  NA             NA
## 5               NA             NA                  NA             NA
## 6               NA             NA                  NA             NA
## 7               NA             NA                  NA             NA
## 8               NA             NA                  NA             NA
## 9               NA             NA                  NA             NA
## 10              NA             NA                  NA             NA
## 11              NA             NA                  NA             NA
## 12              NA             NA                  NA             NA
## 13              NA             NA                  NA             NA
##  [ reached getOption("max.print") -- omitted 3 rows ]

Maybe those accounts were suspended or are set to “private”. We can exclude them:

mdb_twitterstats <- filter(mdb_twitterstats, !is.na(followers_count))

6. Now you can try to answer all sorts of questions you are interested in, for example:

Note: These are just example questions I came up with. You might have had different questions.

Who has the most followers (e.g. make a top 10 list)?

mdb_twitterstats %>% arrange(desc(followers_count)) %>% head(10)

##    personal.first_name personal.last_name personal.gender
## 1               Martin             Schulz            male
## 2                Sahra        Wagenknecht          female
## 3            Christian            Lindner            male
## 4                Heiko               Maas            male
## 5               Sigmar            Gabriel            male
## 6                Peter           Altmaier            male
## 7                Peter             Tauber            male
## 8               Katrin     Göring-Eckardt          female
## 9                  Cem            Özdemir            male
## 10              Thomas          Oppermann            male
##    personal.birthyear      party    twitter_name  account_created_at
## 1                1955        SPD    martinschulz 2008-11-27 10:49:00
## 2                1969  DIE LINKE    swagenknecht 2009-06-15 16:33:09
## 3                1979        FDP       c_lindner 2010-03-11 16:11:51
## 4                1966        SPD       heikomaas 2009-03-13 11:37:00
## 5                1959        SPD   sigmargabriel 2012-05-03 09:27:54
## 6                1958        CDU   peteraltmaier 2011-09-23 17:00:07
## 7                1974        CDU     petertauber 2009-02-28 14:36:49
## 8                1966 DIE GRÜNEN  goeringeckardt 2012-07-04 08:28:33
## 9                1965 DIE GRÜNEN    cem_oezdemir 2009-01-17 12:30:36
## 10               1954        SPD thomasoppermann 2011-05-04 18:06:59
##    followers_count statuses_count days_since_creation tweets_per_day
## 1           696782           4591                3668      1.2516358
## 2           393336           1290                3467      0.3720796
## 3           314876          12580                3198      3.9337086
## 4           299683           5140                3561      1.4434148
## 5           265029           3223                2415      1.3345756
## 6           240866          10546                2637      3.9992416
## 7           190058          22182                3574      6.2064913
## 8           136430          13585                2353      5.7734807
## 9           127532           8191                3616      2.2652102
## 10          127186           2943                2779      1.0590140

Who sends the most tweets per day (again, a top 10 list)?

mdb_twitterstats %>% arrange(desc(tweets_per_day)) %>% head(10)

##    personal.first_name personal.last_name personal.gender
## 1                  Udo         Hemmelgarn            male
## 2             Johannes              Kahrs            male
## 3                 Anke     Domscheit-Berg          female
## 4                 Jörg          Schneider            male
## 5               Saskia              Esken          female
## 6               Dieter            Janecek            male
## 7               Renate             Künast          female
## 8              Stephan           Brandner            male
## 9                  Uwe           Schummer            male
## 10              Nicola               Beer          female
##    personal.birthyear      party  twitter_name  account_created_at
## 1                1959        AfD udohemmelgarn 2017-04-04 00:44:40
## 2                1963        SPD         kahrs 2009-04-27 06:23:08
## 3                1968  DIE LINKE         anked 2008-10-02 10:12:46
## 4                1964        AfD schneider_afd 2016-04-12 19:49:25
## 5                1961        SPD   eskensaskia 2013-05-12 17:25:05
## 6                1976 DIE GRÜNEN      djanecek 2009-06-29 11:02:16
## 7                1955 DIE GRÜNEN renatekuenast 2013-05-13 11:24:44
## 8                1966        AfD    stbrandner 2016-03-25 13:46:18
## 9                1957        CDU   uweschummer 2011-11-10 12:18:40
## 10               1970        FDP nicolabeerfdp 2017-02-06 15:18:51
##    followers_count statuses_count days_since_creation tweets_per_day
## 1             4815          27717                 618      44.849515
## 2            20037          97517                3517      27.727324
## 3            28500          85058                3724      22.840494
## 4             3171          14075                 974      14.450719
## 5             6608          23954                2040      11.742157
## 6             8663          38457                3453      11.137272
## 7            41425          21876                2039      10.728789
## 8             4617          10398                 992      10.481855
## 9             7802          26701                2589      10.313248
## 10            9744           6219                 674       9.227003

How is the distribution of follower counts?

With log10 scale on x:

qplot(followers_count, data = mdb_twitterstats, geom = 'density') + scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1 rows containing non-finite values (stat_density).

What’s the mean and the median follower count per party?

followercounts_party <- group_by(mdb_twitterstats, party) %>%
  summarise(mean_followers = mean(followers_count),
            median_followers = median(followers_count))
followercounts_party

## # A tibble: 7 x 3
##   party      mean_followers median_followers
##   <fct>               <dbl>            <dbl>
## 1 AfD                 4868.            2338.
## 2 CDU                 9858.            2102 
## 3 CSU                 9120.            1400 
## 4 DIE GRÜNEN         13253.            4889 
## 5 DIE LINKE          17242.            4613 
## 6 FDP                10335.            1507 
## 7 SPD                20488.            2951

With log10 scale on y:

ggplot(mdb_twitterstats, aes(x = party, y = followers_count)) + geom_boxplot() + scale_y_log10()

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

How do follower counts differ between men and women in each party?

followercounts_gender <- group_by(mdb_twitterstats, party, personal.gender) %>%
  summarise(mean_followers = mean(followers_count), median_followers = median(followers_count))
followercounts_gender

## # A tibble: 14 x 4
## # Groups:   party [?]
##    party      personal.gender mean_followers median_followers
##    <fct>      <fct>                    <dbl>            <dbl>
##  1 AfD        female                  19701.            6902.
##  2 AfD        male                     2846.            2212.
##  3 CDU        female                   2621.            1899 
##  4 CDU        male                    11509.            2171 
##  5 CSU        female                  15108             1453 
##  6 CSU        male                     5854.            1400 
##  7 DIE GRÜNEN female                  10865.            4889 
##  8 DIE GRÜNEN male                    16437.            4904.
##  9 DIE LINKE  female                  27521.            5501 
## 10 DIE LINKE  male                     7391.            4088.
## 11 FDP        female                   2959.            1317 
## 12 FDP        male                    12793.            1576 
## 13 SPD        female                   4432.            2272 
## 14 SPD        male                    28775.            3185

ggplot(mdb_twitterstats, aes(x = personal.gender, y = followers_count)) +
  geom_boxplot() +
  scale_y_log10() +
  facet_grid(~ party)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

How is the distribution of the tweets per day frequency and how does this measure vary between the different parties?

qplot(tweets_per_day, data = mdb_twitterstats, geom = 'density')

group_by(mdb_twitterstats, party) %>%
  summarise(mean_tweets_per_day = mean(tweets_per_day),
            median_tweets_per_day = median(tweets_per_day))

## # A tibble: 7 x 3
##   party      mean_tweets_per_day median_tweets_per_day
##   <fct>                    <dbl>                 <dbl>
## 1 AfD                      2.61                  0.771
## 2 CDU                      1.38                  0.447
## 3 CSU                      0.895                 0.370
## 4 DIE GRÜNEN               2.68                  2.19 
## 5 DIE LINKE                2.14                  1.35 
## 6 FDP                      1.49                  0.726
## 7 SPD                      1.66                  0.862

ggplot(mdb_twitterstats, aes(x = party, y = tweets_per_day)) + geom_boxplot() + ylim(0, 15)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

R Tutorial at the WZB

Tasks for 11 – Collecting data from the web

Markus Konrad

January 24, 2019

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

2. Create a subset of the data with the following criteria:

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.

6. Now you can try to answer all sorts of questions you are interested in, for example:

R Tutorial at the WZB

Tasks for 11 – Collecting data from the web

Markus Konrad

January 24, 2019

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

2. Create a subset of the data with the following criteria:

3. There are quite a lot NA’s to the personal.gender variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using rtweet.

6. Now you can try to answer all sorts of questions you are interested in, for example:

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.