Remember that you can load CSV files with read.csv()
. Make sure to not read in strings as factors and to read the postal code (variable personal.location.postal_code
) as character string, not as numeric variable. Hint: Have a look at help for read.csv()
and look for the arguments stringsAsFactors
and colClasses
.
This is a data subset on members of the 19th (2017 to 2021 max.) German Bundestag (ger. “Mitglieder des Bundestags – MdB”) that was fetched from abgeordnetenwatch.de.
Familiarize yourself with the data. There is no codebook to the data, which is bad practice, but for the sake of this exercise it should be enough to assess the variables from the column names and their values. Also check the number of observations.
library(tidyverse) # for dplyr and ggplot2
mdb <- read.csv('11collecting-resources/mdb_twitter.csv',
stringsAsFactors = FALSE,
colClasses = c(personal.location.postal_code = "character"))
head(mdb)
## meta.status meta.edited meta.uuid
## 1 1 2017-11-17 12:45 c2d05a34-99a4-4a37-9b49-4f7b118d7f38
## 2 1 2017-11-21 15:37 04b9aad2-cf6c-4411-a27f-9fda675c5d3d
## 3 1 2017-11-21 16:46 aa9a5c1f-ab0e-45e9-ab61-becd66f3660f
## 4 1 2017-11-24 12:25 c0aaa650-de62-41a9-9ace-6cf337be0202
## 5 1 2017-12-12 08:53 22bdef04-58be-4c1c-a795-448f7db42ffe
## meta.username meta.questions meta.answers meta.standard_replies
## 1 detlef-seif 1 0 1
## 2 klaus-dieter-grohler 4 1 0
## 3 dirk-wiese 0 0 0
## 4 gabi-weber 2 0 0
## 5 dirk-vopel 0 0 0
## meta.url
## 1 https://www.abgeordnetenwatch.de/profile/detlef-seif
## 2 https://www.abgeordnetenwatch.de/profile/klaus-dieter-grohler
## 3 https://www.abgeordnetenwatch.de/profile/dirk-wiese
## 4 https://www.abgeordnetenwatch.de/profile/gabi-weber
## 5 https://www.abgeordnetenwatch.de/profile/dirk-vopel
## personal.degree personal.first_name personal.last_name personal.gender
## 1 <NA> Detlef Seif <NA>
## 2 <NA> Klaus-Dieter Gröhler male
## 3 <NA> Dirk Wiese male
## 4 <NA> Gabi Weber female
## 5 <NA> Dirk Vöpel male
## personal.birthyear personal.education personal.profession
## 1 1962 Studium der Rechtswissenschaften MdB, Rechtsanwalt
## 2 1966 Studium der Rechtswissenschaften MdB
## 3 1983 Jurist MdB
## 4 1955 Keramikmalerin MdB
## 5 1971 selbstständiger IT-Systembetreuer MdB
## personal.location.country personal.location.state personal.location.city
## 1 DE Nordrhein-Westfalen Weilerswist
## 2 DE Berlin Berlin
## 3 DE Nordrhein-Westfalen Brilon
## 4 DE Rheinland-Pfalz Wirges
## 5 DE Nordrhein-Westfalen Oberhausen
## personal.location.postal_code
## 1
## 2
## 3
## 4
## 5
## personal.picture.url
## 1 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/detlef_seif_68.jpg
## 2 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/klaus_dieter_groehler_32.jpg
## 3 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/users/kampagne17.jpg
## 4 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/weber_gabi_klein.jpg
## 5 https://www.abgeordnetenwatch.de/sites/abgeordnetenwatch.de/files/dirk_voepel_14.jpg
## personal.picture.copyright party parliament.name
## 1 CDU Bundestag
## 2 CDU Bundestag
## 3 SPD Bundestag
## 4 SPD Bundestag
## 5 SPD Bundestag
## parliament.uuid parliament.joined twitter_name
## 1 60d0787f-e311-4283-a7fd-85b9f62a9b33 <NA> <NA>
## 2 60d0787f-e311-4283-a7fd-85b9f62a9b33 <NA> <NA>
## 3 60d0787f-e311-4283-a7fd-85b9f62a9b33 <NA> dirkwiese4
## 4 60d0787f-e311-4283-a7fd-85b9f62a9b33 <NA> gabiweberspd
## 5 60d0787f-e311-4283-a7fd-85b9f62a9b33 <NA> <NA>
## [ reached getOption("max.print") -- omitted 1 row ]
dim(mdb)
## [1] 708 26
Data with 27 variables on 708 members (Wikipedia says it’s 709 members in the BT, but let’s not be pedantic for that exercise).
personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name
After that, have a look at where we have missing data in the data set that might later cause trouble.
mdb <- mdb %>% select(personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name) %>%
filter(party != 'fraktionslos')
head(mdb)
## personal.first_name personal.last_name personal.gender
## 1 Detlef Seif <NA>
## 2 Klaus-Dieter Gröhler male
## 3 Dirk Wiese male
## 4 Gabi Weber female
## 5 Dirk Vöpel male
## 6 Kersten Steinke <NA>
## personal.birthyear party twitter_name
## 1 1962 CDU <NA>
## 2 1966 CDU <NA>
## 3 1983 SPD dirkwiese4
## 4 1955 SPD gabiweberspd
## 5 1971 SPD <NA>
## 6 1958 DIE LINKE <NA>
Three “fraktionslos” members were removed from the data set:
nrow(mdb)
## [1] 705
sum(is.na(mdb$personal.gender))
## [1] 176
personal.gender
variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.3.1. First, think how you could do imputation on that variable and what the pros and cons of each method are.
We could do manual imputation either by only looking at the name (which could be inaccurate) or by checking the profiles / websites / etc. of each member, which would be the most accurate but also the most laborious way. We could also do it automatically, for example using the genderizeR package.
3.2. Then, inform yourself about the genderizeR package and install it.
Please note that the package uses the genderize.io API to predict the gender of a person using a first name. There is a rate limit of 1000 requests (i.e. 1000 names) per day.
The package is a bit awkward to work with, I have to say. An alternative would be to use the web API directly from R without the package, which would also allow to use additional features such as specifying the country from which the name comes (that increases accuracy). Anyway, that would lead too far, so let’s use the package.
To predict gender from a set of names with this package, you need to use two functions: 1. findGivenNames()
and 2. genderize()
.
findGivenNames()
extracts a first name from a text (in our case, that’s not really necessary because we already have the first name in a separate data frame column), does some text preparation (like lower case transformation), queries the web service of genderize.io and returns the result in a data frame. Note that it also includes an estimate of the correct prediction probability:
library(genderizeR)
testnames <- c('Paula', 'Paul')
test_genderinfo <- findGivenNames(testnames, progress = FALSE)
test_genderinfo
## name gender probability count
## 1: paula female 0.99 2298
## 2: paul male 1 5931
Now we need to pass this result to genderize()
. This is basically only necessary to combine the original names in testnames
with the test_genderinfo
output.
genderize(testnames, test_genderinfo, progress = FALSE)
## text givenName gender genderIndicators
## 1: Paula paula female 1
## 2: Paul paul male 1
We can later use the original name in the text
column for joining.
3.3. Try out the generizeR as shown above with a few names.
3.4. Next, select all first names from the data set of Bundestag members, for which we do not have gender information. Make sure to remove all duplicate names. Then, use the findGivenNames()
function to let the web service predict the gender from the supplied names. Inspect the result data frame.
firstnames <- filter(mdb, is.na(personal.gender)) %>% pull(personal.first_name) %>% unique()
genderinfo <- findGivenNames(firstnames, progress = FALSE)
genderinfo
## name gender probability count
## 1: detlef male 1 10
## 2: kersten female 1 3
## 3: detlev male 1 2
## 4: peter male 1 4373
## 5: marco male 0.99 2493
## ---
## 125: björn male 1 194
## 126: dorothee female 1 36
## 127: claudia female 1 3051
## 128: omid male 1 41
## 129: volker male 1 31
3.5. Have a look at the distribution of the probability
variable in the result. Filter out results with a probability less than 0.9 and investigate them.
qplot(genderinfo$probability)
filter(genderinfo, probability < 0.9)
## name gender probability count
## 1 lothar male 0.89 9
## 2 jan male 0.6 1663
## 3 gerold male 0.8 5
## 4 simone female 0.55 1086
## 5 gabriele male 0.81 250
## 6 kai male 0.87 207
Looks like although the probability for a correct prediction is low for certain names, the predictions still seem reasonable, so we don’t need to filter out any of them.
3.6. Then, use genderize()
to get the complete data frame which also contains the original first names in the text
variable. You should then have a data frame with the original first name and a predicted gender. Use this data frame now for imputation of the data set of Bundestag members. Check if there are still NAs in the personal.gender
variable. If yes, you can manually set a gender if you’re confident or just ignore it.
Hint: A strategy here is to use a join (as we learned in “data linkage”) where the variables personal.first_name
from the members data and text
from the predicted gender data must match. You can then replace NA values in personal.gender
for example by using mutate()
and ifelse()
(there are also other ways).
Getting the full gender information data frame:
genderinfo_complete <- genderize(firstnames, genderinfo, progress = FALSE)
genderinfo_complete
## text givenName gender genderIndicators
## 1: Detlef detlef male 1
## 2: Kersten kersten female 1
## 3: Detlev detlev male 1
## 4: Peter peter male 1
## 5: Marco marco male 1
## ---
## 128: Björn björn male 1
## 129: Dorothee dorothee female 1
## 130: Claudia claudia female 1
## 131: Omid omid male 1
## 132: Volker volker male 1
Joining it with the Bundestag members data and replacing NAs:
mdb <- left_join(mdb, genderinfo_complete, by = c('personal.first_name' = 'text')) %>%
# next we replace all NAs in personal.gender from mdb with predicted gender from genderinfo_complete
# so "gender" from genderinfo_complete will be used when the
# condition "is.na(personal.gender) & !is.na(gender)" is TRUE
# otherwise keep the original value from "personal.gender"
mutate(personal.gender = ifelse(is.na(personal.gender) & !is.na(gender), gender, personal.gender)) %>%
select(-c(givenName, gender, genderIndicators)) # we don't need these columns any more
mdb
## personal.first_name personal.last_name personal.gender
## 1 Detlef Seif male
## 2 Klaus-Dieter Gröhler male
## 3 Dirk Wiese male
## 4 Gabi Weber female
## 5 Dirk Vöpel male
## 6 Kersten Steinke female
## 7 Ursula Schulte female
## 8 Axel Schäfer male
## 9 Ernst Dieter Rossmann male
## 10 Susann Rüthrich female
## 11 Johannes Röring male
## 12 René Röspel male
## 13 Swen Schulz male
## 14 Thomas Rachel male
## 15 Alois Rainer male
## 16 Achim Post male
## 17 Detlev Pilger male
## 18 Henning Otte male
## 19 Andreas Nick male
## 20 Friedrich Ostendorff male
## 21 Susanne Mittag female
## 22 Sabine Leidig female
## 23 Katharina Landgraf female
## 24 Roy Kühne male
## 25 Christian Kühn male
## personal.birthyear party twitter_name
## 1 1962 CDU <NA>
## 2 1966 CDU <NA>
## 3 1983 SPD dirkwiese4
## 4 1955 SPD gabiweberspd
## 5 1971 SPD <NA>
## 6 1958 DIE LINKE <NA>
## 7 1952 SPD <NA>
## 8 1952 SPD <NA>
## 9 1951 SPD edrossmann
## 10 1977 SPD susannruethrich
## 11 1959 CDU <NA>
## 12 1964 SPD <NA>
## 13 1968 SPD swenschulz
## 14 1962 CDU <NA>
## 15 1965 CSU <NA>
## 16 1959 SPD achim_p
## 17 1955 SPD detlevpilger
## 18 1968 CDU <NA>
## 19 1967 CDU drandreasnick
## 20 1953 DIE GRÜNEN fostendorff
## 21 1958 SPD <NA>
## 22 1961 DIE LINKE sabineleidig
## 23 1954 CDU <NA>
## 24 1967 CDU dr_roy_kuehne
## 25 1979 DIE GRÜNEN chriskuehn_mdb
## [ reached getOption("max.print") -- omitted 680 rows ]
Check number of NAs:
sum(is.na(mdb$personal.gender))
## [1] 1
Check who’s that:
filter(mdb, is.na(personal.gender))
## personal.first_name personal.last_name personal.gender
## 1 Siemtje Möller <NA>
## personal.birthyear party twitter_name
## 1 1983 SPD <NA>
After some manual research, do a manual imputation:
mdb[is.na(mdb$personal.gender),]$personal.gender <- 'female'
sum(is.na(mdb$personal.gender))
## [1] 0
3.7. Make a final data frame of the Bundestag members by converting personal.gender
and party
to factors (makes further analysis easier).
mdb <- mutate(mdb, personal.gender = as.factor(personal.gender), party = as.factor(party))
ratio_twitter_party <- group_by(mdb, party) %>%
summarise(n_members = n(),
n_twitter = sum(!is.na(twitter_name)),
ratio_twitter = n_twitter / n_members)
ratio_twitter_party
## # A tibble: 7 x 4
## party n_members n_twitter ratio_twitter
## <fct> <int> <int> <dbl>
## 1 AfD 92 54 0.587
## 2 CDU 199 73 0.367
## 3 CSU 46 17 0.370
## 4 DIE GRÜNEN 67 56 0.836
## 5 DIE LINKE 69 49 0.710
## 6 FDP 80 46 0.575
## 7 SPD 152 99 0.651
ggplot(ratio_twitter_party, aes(x = party, y = ratio_twitter)) + geom_col()
Subset with Twitter users:
mdb_twitter <- filter(mdb, !is.na(twitter_name))
nrow(mdb_twitter)
## [1] 394
rtweet
.From now on, work only with the subset that you created previously (Bundestag members with Twitter account), because we want to fetch some metadata from Twitter. If you want to do that task, you will need to install the package rtweet
and load it. You also need a Twitter account (you don’t have to tweet anything, though!) and need to set up an “App” at developer.twitter.com. You can then find the four authentication keys that you need for accessing the Twitter API in the “Keys and tokens” tab of the app that you created (as shown in the tutorial).
5.1. After that, use the keys to create an access token with create_token()
. In order to check if it works, look up data for a single Twitter account, e.g. “WZB_Berlin” using the function lookup_users()
. Investigate the result in terms of which data is returned. What could be interesting for us?
library(rtweet)
source('twitterkeys.R')
token <- create_token(
app = "WZBAnalysis",
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_secret)
lookup_users('WZB_Berlin')
## # A tibble: 1 x 88
## user_id status_id created_at screen_name text source
## * <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 459107… 10732269… 2018-12-13 14:43:32 WZB_Berlin "@Pe… Tweet…
## # ... with 82 more variables: display_text_width <int>,
## # reply_to_status_id <lgl>, reply_to_user_id <lgl>,
## # reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, hashtags <list>,
## # symbols <list>, urls_url <list>, urls_t.co <list>,
## # urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## # media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## # ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## # quoted_favorite_count <int>, quoted_retweet_count <int>,
## # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## # quoted_followers_count <int>, quoted_friends_count <int>,
## # quoted_statuses_count <int>, quoted_location <chr>,
## # quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>,
## # retweet_created_at <dttm>, retweet_source <chr>,
## # retweet_favorite_count <int>, retweet_retweet_count <int>,
## # retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>,
## # country <chr>, country_code <chr>, geo_coords <list>,
## # coords_coords <list>, bbox_coords <list>, status_url <chr>,
## # name <chr>, location <chr>, description <chr>, url <chr>,
## # protected <lgl>, followers_count <int>, friends_count <int>,
## # listed_count <int>, statuses_count <int>, favourites_count <int>,
## # account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## # profile_expanded_url <chr>, account_lang <chr>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
A data frame with 88 variables is returned. It also includes the latest tweet of this account and some meta data for that tweet. However, what is particularly interesting are variables like followers_count
, friends_count
(the number of accounts this user follows) or statuses_count
(the number of tweets sent since account creation which in turn is stored in account_created_at
).
5.2. Now look up user data for all Twitter account names in the Bundestag members data set (note that this will take several seconds) and store the result.
userdata <- lookup_users(mdb_twitter$twitter_name)
userdata
## # A tibble: 378 x 88
## user_id status_id created_at screen_name text source
## * <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 815469… 10735648… 2018-12-14 13:06:32 DirkWiese4 Inte… Twitt…
## 2 992808… 10729332… 2018-12-12 19:16:28 GabiWeberS… Blic… Faceb…
## 3 386459… 10731567… 2018-12-13 10:04:47 edrossmann @Yan… Twitt…
## 4 146199… 46706835… 2014-05-15 22:25:47 SusannRuet… mit … Twitt…
## 5 217916… 10732029… 2018-12-13 13:08:26 swenschulz "Sie… Twitt…
## 6 111774… 10732633… 2018-12-13 17:08:17 Achim_P Im I… Twitt…
## 7 494500… 10735656… 2018-12-14 13:09:43 DrAndreasN… Pein… Twitt…
## 8 862714… 10731539… 2018-12-13 09:53:45 FOstendorff "...… Twitt…
## 9 257461… 10735546… 2018-12-14 12:26:00 SabineLeid… http… Hoots…
## 10 190691… 10716930… 2018-12-09 09:08:33 Dr_Roy_Kue… @Ilm… Twitt…
## # ... with 368 more rows, and 82 more variables: display_text_width <int>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, hashtags <list>,
## # symbols <list>, urls_url <list>, urls_t.co <list>,
## # urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## # media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## # ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## # quoted_favorite_count <int>, quoted_retweet_count <int>,
## # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## # quoted_followers_count <int>, quoted_friends_count <int>,
## # quoted_statuses_count <int>, quoted_location <chr>,
## # quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>,
## # retweet_created_at <dttm>, retweet_source <chr>,
## # retweet_favorite_count <int>, retweet_retweet_count <int>,
## # retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>,
## # country <chr>, country_code <chr>, geo_coords <list>,
## # coords_coords <list>, bbox_coords <list>, status_url <chr>,
## # name <chr>, location <chr>, description <chr>, url <chr>,
## # protected <lgl>, followers_count <int>, friends_count <int>,
## # listed_count <int>, statuses_count <int>, favourites_count <int>,
## # account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## # profile_expanded_url <chr>, account_lang <chr>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
5.3. Create a subset of the Twitter user data with only the variables screen_name, account_created_at, followers_count, statuses_count
. If you wanted to join this data with the Bundestag members data, which variable would you use to match the observations?
twitter_name
from the Bundestag members and the screen_name
from the Twitter API data must match for joining. We must watch out that both are all lower case (twitter_name
already is).
5.4. Transform the data in the subset using mutate()
so that 1) the screen_name
is always lower case; 2) you have a new variable days_since_creation
containing the “age” of the account in days calculated with round(as.numeric(as.POSIXct('2018-12-13') - account_created_at))
; 3) another new variable tweets_per_day
that is the average number of tweets sent per day by this account (calculate it with the help of days_since_creation
). We can then use tweets_per_day
as a measure of activity on Twitter.
usersub <- select(userdata, screen_name, account_created_at, followers_count, statuses_count) %>% # subset
# create the new variables
mutate(screen_name = tolower(screen_name), # this is the date the data was collected
days_since_creation = round(as.numeric(as.POSIXct('2018-12-13') - account_created_at)),
tweets_per_day = statuses_count / days_since_creation)
usersub
## # A tibble: 378 x 6
## screen_name account_created_at followers_count statuses_count
## <chr> <dttm> <int> <int>
## 1 dirkwiese4 2017-01-01 08:06:30 1643 3065
## 2 gabiwebers… 2012-12-06 10:29:09 3224 2776
## 3 edrossmann 2009-05-08 11:17:22 2360 590
## 4 susannruet… 2013-05-27 12:42:27 620 19
## 5 swenschulz 2009-02-24 20:14:10 3724 1136
## 6 achim_p 2013-01-24 21:12:00 3342 5082
## 7 drandreasn… 2009-06-21 22:26:05 3970 19374
## 8 fostendorff 2012-10-05 08:17:20 3148 1021
## 9 sabineleid… 2011-02-25 13:33:26 6970 6957
## 10 dr_roy_kue… 2013-09-26 06:13:48 1523 529
## # ... with 368 more rows, and 2 more variables: days_since_creation <dbl>,
## # tweets_per_day <dbl>
5.5. Join the data of the Bundestag members with Twitter account and the data fetched from the Twitter API by matching twitter_name
with screen_name
. Check where the matching failed (i.e. because no data could be fetched for that account from Twitter).
mdb_twitterstats <- left_join(mdb_twitter, usersub, by = c("twitter_name" = "screen_name"))
mdb_twitterstats
## personal.first_name personal.last_name personal.gender
## 1 Dirk Wiese male
## 2 Gabi Weber female
## 3 Ernst Dieter Rossmann male
## 4 Susann Rüthrich female
## 5 Swen Schulz male
## 6 Achim Post male
## 7 Detlev Pilger male
## 8 Andreas Nick male
## 9 Friedrich Ostendorff male
## 10 Sabine Leidig female
## 11 Roy Kühne male
## 12 Christian Kühn male
## 13 Maik Beermann male
## personal.birthyear party twitter_name account_created_at
## 1 1983 SPD dirkwiese4 2017-01-01 08:06:30
## 2 1955 SPD gabiweberspd 2012-12-06 10:29:09
## 3 1951 SPD edrossmann 2009-05-08 11:17:22
## 4 1977 SPD susannruethrich 2013-05-27 12:42:27
## 5 1968 SPD swenschulz 2009-02-24 20:14:10
## 6 1959 SPD achim_p 2013-01-24 21:12:00
## 7 1955 SPD detlevpilger <NA>
## 8 1967 CDU drandreasnick 2009-06-21 22:26:05
## 9 1953 DIE GRÜNEN fostendorff 2012-10-05 08:17:20
## 10 1961 DIE LINKE sabineleidig 2011-02-25 13:33:26
## 11 1967 CDU dr_roy_kuehne 2013-09-26 06:13:48
## 12 1979 DIE GRÜNEN chriskuehn_mdb 2013-02-02 18:49:01
## 13 1981 CDU maikbeermann 2012-05-02 17:01:13
## followers_count statuses_count days_since_creation tweets_per_day
## 1 1643 3065 711 4.310830e+00
## 2 3224 2776 2198 1.262966e+00
## 3 2360 590 3505 1.683310e-01
## 4 620 19 2025 9.382716e-03
## 5 3724 1136 3578 3.174958e-01
## 6 3342 5082 2148 2.365922e+00
## 7 NA NA NA NA
## 8 3970 19374 3461 5.597804e+00
## 9 3148 1021 2260 4.517699e-01
## 10 6970 6957 2847 2.443625e+00
## 11 1523 529 1904 2.778361e-01
## 12 2273 1759 2139 8.223469e-01
## 13 1860 2466 2415 1.021118e+00
## [ reached getOption("max.print") -- omitted 381 rows ]
Find out how often the matching failed:
sum(is.na(mdb_twitterstats$followers_count))
## [1] 16
And where it did:
filter(mdb_twitterstats, is.na(followers_count))
## personal.first_name personal.last_name personal.gender
## 1 Detlev Pilger male
## 2 Andreas Rimkus male
## 3 Matthias Büttner male
## 4 Sandra Weeser female
## 5 Stephan Protschka male
## 6 Doris Achelwilm female
## 7 Bernd Reuther male
## 8 Manfred Grund male
## 9 Johannes Schraps male
## 10 Eckhardt Rehberg male
## 11 Gökay Akbulut female
## 12 René Springer male
## 13 Torbjörn Kartes male
## personal.birthyear party twitter_name account_created_at
## 1 1955 SPD detlevpilger <NA>
## 2 1962 SPD spduesseldorf <NA>
## 3 1990 AfD buettnersdl <NA>
## 4 1969 FDP weesersandra <NA>
## 5 1977 AfD protschkastepha <NA>
## 6 1976 DIE LINKE doris_achelwilm <NA>
## 7 1971 FDP reutherbernd <NA>
## 8 1955 CDU manfred_grund <NA>
## 9 1983 SPD jojoschraps <NA>
## 10 1954 CDU eckhardtrehberg <NA>
## 11 1982 DIE LINKE goekayakbulut <NA>
## 12 1979 AfD springerren <NA>
## 13 1979 CDU torbjoernkartes <NA>
## followers_count statuses_count days_since_creation tweets_per_day
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 NA NA NA NA
## 11 NA NA NA NA
## 12 NA NA NA NA
## 13 NA NA NA NA
## [ reached getOption("max.print") -- omitted 3 rows ]
Maybe those accounts were suspended or are set to “private”. We can exclude them:
mdb_twitterstats <- filter(mdb_twitterstats, !is.na(followers_count))
Note: These are just example questions I came up with. You might have had different questions.
Who has the most followers (e.g. make a top 10 list)?
mdb_twitterstats %>% arrange(desc(followers_count)) %>% head(10)
## personal.first_name personal.last_name personal.gender
## 1 Martin Schulz male
## 2 Sahra Wagenknecht female
## 3 Christian Lindner male
## 4 Heiko Maas male
## 5 Sigmar Gabriel male
## 6 Peter Altmaier male
## 7 Peter Tauber male
## 8 Katrin Göring-Eckardt female
## 9 Cem Özdemir male
## 10 Thomas Oppermann male
## personal.birthyear party twitter_name account_created_at
## 1 1955 SPD martinschulz 2008-11-27 10:49:00
## 2 1969 DIE LINKE swagenknecht 2009-06-15 16:33:09
## 3 1979 FDP c_lindner 2010-03-11 16:11:51
## 4 1966 SPD heikomaas 2009-03-13 11:37:00
## 5 1959 SPD sigmargabriel 2012-05-03 09:27:54
## 6 1958 CDU peteraltmaier 2011-09-23 17:00:07
## 7 1974 CDU petertauber 2009-02-28 14:36:49
## 8 1966 DIE GRÜNEN goeringeckardt 2012-07-04 08:28:33
## 9 1965 DIE GRÜNEN cem_oezdemir 2009-01-17 12:30:36
## 10 1954 SPD thomasoppermann 2011-05-04 18:06:59
## followers_count statuses_count days_since_creation tweets_per_day
## 1 696782 4591 3668 1.2516358
## 2 393336 1290 3467 0.3720796
## 3 314876 12580 3198 3.9337086
## 4 299683 5140 3561 1.4434148
## 5 265029 3223 2415 1.3345756
## 6 240866 10546 2637 3.9992416
## 7 190058 22182 3574 6.2064913
## 8 136430 13585 2353 5.7734807
## 9 127532 8191 3616 2.2652102
## 10 127186 2943 2779 1.0590140
Who sends the most tweets per day (again, a top 10 list)?
mdb_twitterstats %>% arrange(desc(tweets_per_day)) %>% head(10)
## personal.first_name personal.last_name personal.gender
## 1 Udo Hemmelgarn male
## 2 Johannes Kahrs male
## 3 Anke Domscheit-Berg female
## 4 Jörg Schneider male
## 5 Saskia Esken female
## 6 Dieter Janecek male
## 7 Renate Künast female
## 8 Stephan Brandner male
## 9 Uwe Schummer male
## 10 Nicola Beer female
## personal.birthyear party twitter_name account_created_at
## 1 1959 AfD udohemmelgarn 2017-04-04 00:44:40
## 2 1963 SPD kahrs 2009-04-27 06:23:08
## 3 1968 DIE LINKE anked 2008-10-02 10:12:46
## 4 1964 AfD schneider_afd 2016-04-12 19:49:25
## 5 1961 SPD eskensaskia 2013-05-12 17:25:05
## 6 1976 DIE GRÜNEN djanecek 2009-06-29 11:02:16
## 7 1955 DIE GRÜNEN renatekuenast 2013-05-13 11:24:44
## 8 1966 AfD stbrandner 2016-03-25 13:46:18
## 9 1957 CDU uweschummer 2011-11-10 12:18:40
## 10 1970 FDP nicolabeerfdp 2017-02-06 15:18:51
## followers_count statuses_count days_since_creation tweets_per_day
## 1 4815 27717 618 44.849515
## 2 20037 97517 3517 27.727324
## 3 28500 85058 3724 22.840494
## 4 3171 14075 974 14.450719
## 5 6608 23954 2040 11.742157
## 6 8663 38457 3453 11.137272
## 7 41425 21876 2039 10.728789
## 8 4617 10398 992 10.481855
## 9 7802 26701 2589 10.313248
## 10 9744 6219 674 9.227003
How is the distribution of follower counts?
With log10 scale on x:
qplot(followers_count, data = mdb_twitterstats, geom = 'density') + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing non-finite values (stat_density).
What’s the mean and the median follower count per party?
followercounts_party <- group_by(mdb_twitterstats, party) %>%
summarise(mean_followers = mean(followers_count),
median_followers = median(followers_count))
followercounts_party
## # A tibble: 7 x 3
## party mean_followers median_followers
## <fct> <dbl> <dbl>
## 1 AfD 4868. 2338.
## 2 CDU 9858. 2102
## 3 CSU 9120. 1400
## 4 DIE GRÜNEN 13253. 4889
## 5 DIE LINKE 17242. 4613
## 6 FDP 10335. 1507
## 7 SPD 20488. 2951
With log10 scale on y:
ggplot(mdb_twitterstats, aes(x = party, y = followers_count)) + geom_boxplot() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
How do follower counts differ between men and women in each party?
followercounts_gender <- group_by(mdb_twitterstats, party, personal.gender) %>%
summarise(mean_followers = mean(followers_count), median_followers = median(followers_count))
followercounts_gender
## # A tibble: 14 x 4
## # Groups: party [?]
## party personal.gender mean_followers median_followers
## <fct> <fct> <dbl> <dbl>
## 1 AfD female 19701. 6902.
## 2 AfD male 2846. 2212.
## 3 CDU female 2621. 1899
## 4 CDU male 11509. 2171
## 5 CSU female 15108 1453
## 6 CSU male 5854. 1400
## 7 DIE GRÜNEN female 10865. 4889
## 8 DIE GRÜNEN male 16437. 4904.
## 9 DIE LINKE female 27521. 5501
## 10 DIE LINKE male 7391. 4088.
## 11 FDP female 2959. 1317
## 12 FDP male 12793. 1576
## 13 SPD female 4432. 2272
## 14 SPD male 28775. 3185
ggplot(mdb_twitterstats, aes(x = personal.gender, y = followers_count)) +
geom_boxplot() +
scale_y_log10() +
facet_grid(~ party)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
How is the distribution of the tweets per day frequency and how does this measure vary between the different parties?
qplot(tweets_per_day, data = mdb_twitterstats, geom = 'density')
group_by(mdb_twitterstats, party) %>%
summarise(mean_tweets_per_day = mean(tweets_per_day),
median_tweets_per_day = median(tweets_per_day))
## # A tibble: 7 x 3
## party mean_tweets_per_day median_tweets_per_day
## <fct> <dbl> <dbl>
## 1 AfD 2.61 0.771
## 2 CDU 1.38 0.447
## 3 CSU 0.895 0.370
## 4 DIE GRÜNEN 2.68 2.19
## 5 DIE LINKE 2.14 1.35
## 6 FDP 1.49 0.726
## 7 SPD 1.66 0.862
ggplot(mdb_twitterstats, aes(x = party, y = tweets_per_day)) + geom_boxplot() + ylim(0, 15)
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).