R Tutorial at the WZB

1. Load text data into a `tm` corpus structure using `VCorpus` and the respective data source function (like `DirSource` or `DataframeSource`).

Inspect the corpus and a document in it.

You can work with any textual data you like. I provide two data sets in 08textmining-resources.zip, which can be downloaded from the tutorial website:

election_ger17_manifestos.RDS: Party manifestos for German Bundestag election 2017 as data frame in an RDS file (load with readRDS()) from WZB’s Manifesto Project.
nltk_europarl: Folder with a sample for the European Parliament Proceedings Parallel Corpus.

Hint: If you want to read a data frame into to a VCorpus, you need to prepare the data frame so that it has the correct names and order of columns. See the documentation in ?DataframeSource for how the data frame must be formatted.

2. Normalize the corpus using different text transformations (lower case transformation, removal of punctuation, stemming, etc.).

Inspect the results (with findMostFreqTerms() and other functions) and compare. Especially focus on the difference between applying or not stopword removal.

3. Create a tf-idf weighted DTM from a normalized corpus for which no stopword removal was applied.

Inspect the results like before. Do stopwords still appear in the list of most frequent terms?

4. Analyse the word usage for selected words (choose yourself) across documents.

You can plot a “heatmap” to aid the eye. I’ve created a function plot_dtm_heatmap() in the script file plot_heatmaps.R contained in 08textmining-resources.zip. It plots a matrix as heatmap. Load the function with source('path/to/plot_heatmaps.R'). Analyse the word usage using absolute word counts and word proportions.

Hint: You can convert a DTM to a standard R matrix via as.matrix(). Then, you can subset the matrix for certain words using column-wise indexing like mat_dtm1[, c('example', 'term')].

5. Calculate the Euclidian distance between the documents using the `dist()` function.

Use a DTM with absolute word count first and then compare with a DTM with word proportions.

I’ve created a function plot_dist_heatmap() in the script file plot_heatmaps.R, which you can use to draw a heatmap from the distance matrix.

R Tutorial at the WZB

Tasks for session 8 - Text Mining I

Markus Konrad

December 13, 2018

1. Load text data into a `tm` corpus structure using `VCorpus` and the respective data source function (like `DirSource` or `DataframeSource`).

2. Normalize the corpus using different text transformations (lower case transformation, removal of punctuation, stemming, etc.).

3. Create a tf-idf weighted DTM from a normalized corpus for which no stopword removal was applied.

4. Analyse the word usage for selected words (choose yourself) across documents.

5. Calculate the Euclidian distance between the documents using the `dist()` function.

R Tutorial at the WZB

Tasks for session 8 - Text Mining I

Markus Konrad

December 13, 2018

1. Load text data into a tm corpus structure using VCorpus and the respective data source function (like DirSource or DataframeSource).

2. Normalize the corpus using different text transformations (lower case transformation, removal of punctuation, stemming, etc.).

3. Create a tf-idf weighted DTM from a normalized corpus for which no stopword removal was applied.

4. Analyse the word usage for selected words (choose yourself) across documents.

5. Calculate the Euclidian distance between the documents using the dist() function.

1. Load text data into a `tm` corpus structure using `VCorpus` and the respective data source function (like `DirSource` or `DataframeSource`).

5. Calculate the Euclidian distance between the documents using the `dist()` function.