tm
corpus structure using VCorpus
and the respective data source function (like DirSource
or DataframeSource
).Inspect the corpus and a document in it.
You can work with any textual data you like. I provide two data sets in 08textmining-resources.zip, which can be downloaded from the tutorial website:
election_ger17_manifestos.RDS
: Party manifestos for German Bundestag election 2017 as data frame in an RDS file (load with readRDS()
) from WZB’s Manifesto Project.nltk_europarl
: Folder with a sample for the European Parliament Proceedings Parallel Corpus.Hint: If you want to read a data frame into to a VCorpus
, you need to prepare the data frame so that it has the correct names and order of columns. See the documentation in ?DataframeSource
for how the data frame must be formatted.
Inspect the results (with findMostFreqTerms()
and other functions) and compare. Especially focus on the difference between applying or not stopword removal.
Inspect the results like before. Do stopwords still appear in the list of most frequent terms?
You can plot a “heatmap” to aid the eye. I’ve created a function plot_dtm_heatmap()
in the script file plot_heatmaps.R
contained in 08textmining-resources.zip. It plots a matrix as heatmap. Load the function with source('path/to/plot_heatmaps.R')
. Analyse the word usage using absolute word counts and word proportions.
Hint: You can convert a DTM to a standard R matrix via as.matrix()
. Then, you can subset the matrix for certain words using column-wise indexing like mat_dtm1[, c('example', 'term')]
.
dist()
function.Use a DTM with absolute word count first and then compare with a DTM with word proportions.
I’ve created a function plot_dist_heatmap()
in the script file plot_heatmaps.R
, which you can use to draw a heatmap from the distance matrix.