1. Load text data into a tm corpus structure using VCorpus and the respective data source function (like DirSource or DataframeSource).

Inspect the corpus and a document in it.

You can work with any textual data you like. I provide two data sets in 08textmining-resources.zip, which can be downloaded from the tutorial website:

Hint: If you want to read a data frame into to a VCorpus, you need to prepare the data frame so that it has the correct names and order of columns. See the documentation in ?DataframeSource for how the data frame must be formatted.

2. Normalize the corpus using different text transformations (lower case transformation, removal of punctuation, stemming, etc.).

Inspect the results (with findMostFreqTerms() and other functions) and compare. Especially focus on the difference between applying or not stopword removal.

3. Create a tf-idf weighted DTM from a normalized corpus for which no stopword removal was applied.

Inspect the results like before. Do stopwords still appear in the list of most frequent terms?

4. Analyse the word usage for selected words (choose yourself) across documents.

You can plot a “heatmap” to aid the eye. I’ve created a function plot_dtm_heatmap() in the script file plot_heatmaps.R contained in 08textmining-resources.zip. It plots a matrix as heatmap. Load the function with source('path/to/plot_heatmaps.R'). Analyse the word usage using absolute word counts and word proportions.

Hint: You can convert a DTM to a standard R matrix via as.matrix(). Then, you can subset the matrix for certain words using column-wise indexing like mat_dtm1[, c('example', 'term')].

5. Calculate the Euclidian distance between the documents using the dist() function.

Use a DTM with absolute word count first and then compare with a DTM with word proportions.

I’ve created a function plot_dist_heatmap() in the script file plot_heatmaps.R, which you can use to draw a heatmap from the distance matrix.