- Review of last week's tasks
- Text as data
- Text mining methods for the Social Sciences
- Matrices and lists in R
- Bag-of-words model
- Practical text mining with the
tm
package (I)
December 13, 2018
tm
package (I)now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Natural language is context-dependent, loosely structured and often ambigious. This makes extracting structured information hard.
Text mining (TM) or text analytics tries to uncover structured key information from natural language text.
Other important fields:
Text material is compiled to a corpus. This is the data base for TM contains a set of documents. Each document has:
Documents can be anything: news articles, scientific papers, twitter posts, books, paragraphs of books, speeches, etc.
Usually, you don't mix different sorts of text within a corpus.
A token is the lexical unit you work with during your analysis. This can be phrases, words, symbols, characters, etc.
→ ~ unit of measurement in your TM project.
Even if you initially use words as lexical unit, a tokenized and processed word might not be a lexicographically correct word anymore.
Example that employs stemming and lower-case transformation:
"I argued with him" → ["i", "argu", "with", "he"]
Tokens are also called terms.
What can you find out with text mining? A few key methods often employed in the Soc. Sciences:
1. Simple & weighted word frequency comparisons
Count the words that occur in each document, calculate proportions, compare.
Weighted frequencies: Increase importance of document-specific words, reduce importance of very common words
→ key concept: term frequency - inverse document frequency
(tf-idf).
1. Simple & weighted word frequency comparisons
2. Word co-occurrence and correlation
How often do pairs of words appear together per document?
3. Document classification
Approach:
Examples:
4. Document similarity and clustering
How similar is document A as compared to document B?
Mostly used with word frequencies → compare (weighted) word usage between documents.
Once you have similarity scores for documents, you can cluster them.
Hierarchical clusters of party manifestos for Bundestag election 2017
5. Term similarity and edit-distances
Term similarity work on the level of terms and their (phonetic, lexicographic, etc.) similarity. Edit-distances are often used to measure the editing difference between two terms or two documents A, B (how many editing steps to you need to come from A to B?).
Example: Levenshtein distance between "kitten" and "sitting" is 3 edits.
Pratical example: Measure how much drafts for a law changed over time.
6. Topic modeling
Unsupervised machine learning approach to find latent topics in text corpora. Topics are distributions across words. Each document can be represented as a mixture of topics.
Practical example: Measure how the presence of certain topics changed over time in parliamentary debates; differences between parties, etc.
7. Sentiment analysis
Also known as opinion mining. In it's basic form, it tries to find out if the sentiment in a document is positive, neutral or negative by assigning a sentiment score.
This score can be estimated by using supervised machine learning approaches (using training data of already scored documents) or in a lexicon-based manner (adding up the individual sentiment scores for each word in the text).
Named entity recognition: Find out company names, people's names, etc. in texts.
Gender (from name) prediction: Estimate the gender of a person (for example from a name).
… and much more
TM consists of several steps, each of them applying a variety of methods:
Which steps and methods you apply depends on your material and the modeling approach.
Specific methods:
The matrix
structure stores data in a matrix with \(m\) rows and \(n\) columns. Each value must be of the same data type (type coercion rules apply).
To create a matrix, specify the data and its dimensions:
matrix(1:6, nrow = 2, ncol = 3)
## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
The matrix
structure stores data in a matrix with \(m\) rows and \(n\) columns. Each value must be of the same data type (type coercion rules apply).
To create a matrix, specify the data and its dimensions:
matrix(1:6, nrow = 3, ncol = 2)
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
The matrix
structure stores data in a matrix with \(m\) rows and \(n\) columns. Each value must be of the same data type (type coercion rules apply).
To create a matrix, specify the data and its dimensions:
# fill data in rowwise order matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
(A <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
The same indexing rules as for data frames apply. Individual cells are selected by [row index, column index]
:
A[2, 3]
## [1] 6
Rows are selected by [row index,]
:
A[2,]
## [1] 4 5 6
Columns are selected by [, column index]
:
A[,3]
## [1] 3 6
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Matrix B
with dimensions 3x3:
(B <- matrix(rep(1:3, 3), nrow = 3, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 1 2 3 ## [3,] 1 2 3
Matrix multiplication:
A %*% B
## [,1] [,2] [,3] ## [1,] 6 12 18 ## [2,] 15 30 45
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Matrix C
with same dimensions as A
:
(C <- matrix(6:1, nrow = 2, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3] ## [1,] 6 5 4 ## [2,] 3 2 1
Matrix addition:
A + C
## [,1] [,2] [,3] ## [1,] 7 7 7 ## [2,] 7 7 7
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Matrix C
with same dimensions as A
:
(C <- matrix(6:1, nrow = 2, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3] ## [1,] 6 5 4 ## [2,] 3 2 1
Element-wise multiplication:
A * C
## [,1] [,2] [,3] ## [1,] 6 10 12 ## [2,] 12 10 6
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Rowwise normalization of A
:
rowSums(A)
## [1] 6 15
A / rowSums(A)
## [,1] [,2] [,3] ## [1,] 0.1666667 0.3333333 0.5 ## [2,] 0.2666667 0.3333333 0.4
Transpose:
t(A)
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
As with data frames, row names and column names can optionally be set via rownames()
and colnames()
:
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
rownames(A) <- c('row1', 'row2') colnames(A) <- c('col1', 'col2', 'col3') A
## col1 col2 col3 ## row1 1 2 3 ## row2 4 5 6
A['row2',]
## col1 col2 col3 ## 4 5 6
In contrast to vectors and matrices, lists can contain elements of different types:
list(1:3, 'abc', 3.1415, c(FALSE, TRUE, TRUE, FALSE))
## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] "abc" ## ## [[3]] ## [1] 3.1415 ## ## [[4]] ## [1] FALSE TRUE TRUE FALSE
You can think of a list as arbitrary "key-value" data structure. For each unique "key" (i.e. index), a list can hold a value of arbitrary type, even another list.
l <- list(a = 1:3, b = 'abc', c = 3.1415, d = c(FALSE, TRUE, TRUE, FALSE), e = list(1, 2, 3)) str(l)
## List of 5 ## $ a: int [1:3] 1 2 3 ## $ b: chr "abc" ## $ c: num 3.14 ## $ d: logi [1:4] FALSE TRUE TRUE FALSE ## $ e:List of 3 ## ..$ : num 1 ## ..$ : num 2 ## ..$ : num 3
If no key is given, the default keys are set as 1 to N:
(l <- list(1:3, 'abc', 3.1415, c(FALSE, TRUE, TRUE, FALSE)))
## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] "abc" ## ## [[3]] ## [1] 3.1415 ## ## [[4]] ## [1] FALSE TRUE TRUE FALSE
Indexing with single square brackets always results in a new list (here, containing only a single element):
l[4]
## [[1]] ## [1] FALSE TRUE TRUE FALSE
If no key is given, the default keys are set as 1 to N:
(l <- list(1:3, 'abc', 3.1415, c(FALSE, TRUE, TRUE, FALSE)))
## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] "abc" ## ## [[3]] ## [1] 3.1415 ## ## [[4]] ## [1] FALSE TRUE TRUE FALSE
Use double square brackets to get the actual element as vector:
l[[4]]
## [1] FALSE TRUE TRUE FALSE
We can explicitly define keys for a list:
l <- list(a = 1:3, b = 'abc', c = 3.1415, d = c(FALSE, TRUE, TRUE, FALSE), e = list(1, 2, 3)) str(l)
## List of 5 ## $ a: int [1:3] 1 2 3 ## $ b: chr "abc" ## $ c: num 3.14 ## $ d: logi [1:4] FALSE TRUE TRUE FALSE ## $ e:List of 3 ## ..$ : num 1 ## ..$ : num 2 ## ..$ : num 3
The same rules for single and double square brackets apply:
l['d']
## $d ## [1] FALSE TRUE TRUE FALSE
We can explicitly define keys for a list:
l <- list(a = 1:3, b = 'abc', c = 3.1415, d = c(FALSE, TRUE, TRUE, FALSE), e = list(1, 2, 3)) str(l)
## List of 5 ## $ a: int [1:3] 1 2 3 ## $ b: chr "abc" ## $ c: num 3.14 ## $ d: logi [1:4] FALSE TRUE TRUE FALSE ## $ e:List of 3 ## ..$ : num 1 ## ..$ : num 2 ## ..$ : num 3
The same rules for single and double square brackets apply:
l[['d']]
## [1] FALSE TRUE TRUE FALSE
We can explicitly define keys for a list:
l <- list(a = 1:3, b = 'abc', c = 3.1415, d = c(FALSE, TRUE, TRUE, FALSE), e = list(1, 2, 3)) str(l)
## List of 5 ## $ a: int [1:3] 1 2 3 ## $ b: chr "abc" ## $ c: num 3.14 ## $ d: logi [1:4] FALSE TRUE TRUE FALSE ## $ e:List of 3 ## ..$ : num 1 ## ..$ : num 2 ## ..$ : num 3
A shortcut to access elements in a list by key is the dollar symbol:
l$d # same as l[['d']]
## [1] FALSE TRUE TRUE FALSE
Bag-of-words is a simple, but powerful representation of a text corpus.
Three documents:
doc_id | text |
---|---|
1 | Smithers, release the hounds. |
2 | Smithers, unleash the League of Evil! |
3 | The evil Evil of the most Evil. |
The resulting DTM with normalized words:
So far, we've used unigrams. Each word ("term") is counted individually.
We can also count subsequent word combinations (n-grams). This counts \(n\) subsequent words for each word:
"Smithers, release the hounds."
→ as bigrams (2-grams):
["smithers release", "release the", "the hounds"]
Again, our example data:
doc_id | text |
---|---|
1 | Smithers, release the hounds. |
2 | Smithers, unleash the League of Evil! |
3 | The evil Evil of the most Evil. |
Bigrams:
Problem with BoW: common (uninformative) words (e.g. "the, a, and, or, …") that occur often in many documents overshadow more specific (potentially more interesting) words.
Solutions:
Tf-idf (term frequency – inverse document frequency) is such a weighting factor.
For each term \(t\) in each document \(d\) in a corpus of all documents \(D\), the \(\text{tfidf}\) weighting factor is calculated as product of two factors:
\[ \text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D) \]
There are different weighting variants for both factors.
Common variants:
Again, many variants. We'll use this one:
\[ \text{idf}(t, D) = log_2 (1 + \frac{|D|}{|d \in D: t \in d|}) \]
Again, many variants. We'll use this one:
\[ \text{idf}(t, D) = log_2 (1 + \frac{|D|}{|d \in D: t \in d|}) \]
Calculate \(|d \in D: t \in d|\) (number of doc. \(d\) in which \(t\) appears) for all terms:
## evil hounds league most release smithers the unleash ## 2 1 1 1 1 2 3 1
Again, many variants. We'll use this one:
\[ \text{idf}(t, D) = log_2 (1 + \frac{|D|}{|d \in D: t \in d|}) \]
Plug-in to above formula and you get the \(\text{idf}\) for all terms:
## evil hounds league most release smithers the unleash ## 1.32 2.00 2.00 2.00 2.00 1.32 1.00 2.00
This factor is multiplied to each term frequency
→ the more common the word in the corpus, the lower its \(\text{idf}\) value
The distribution of words in a natural language text usually follows the "Zipfian distribution", which relates to Zipf's law:
Zipf’s law states that the frequency that a word appears is inversely proportional to its rank. – Silge & Robinson 2017
\[\text{frequency} \propto r^{-1}\]
→ second most frequent word occurs half as often as the most frequent word; third most frequent word occurs a third of the time of the most frequent word, etc.
To account for that, we use logarithmical values:
Back to the initial formula:
\[ \text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D) \]
Back to the initial formula:
\[ \text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D) \]
Result after matrix multiplication between \(\text{tf}\) and diagonal of \(\text{idf}\):
→ uncommon (i.e. more specific) words get higher weight (e.g. "hounds" or "league")
Once we have a DTM, we can consider each document as a vector across terms (each row in a DTM is a vector of size \(N_{terms}\)).
E.g. document #3 has the following term count vector:
## [,1] ## evil 3 ## hounds 0 ## league 0 ## most 1 ## release 0 ## smithers 0 ## the 2 ## unleash 0
In machine learning terminology this is a feature vector. We can use these features for example for document classification, document similarity, document clustering, etc.
Most packages, tutorials, etc. are designed for English language texts. When you work with other languages, you may need to apply other methods for text preprocessing. For example, working with German texts might require proper lemmatization to bring words from their inflected form to their base form (e.g. "geschlossen" → "schließen").
tm
packagetm
packageResources to start:
I will demonstrate how to use the package to investigate word frequency and document similarity.
A corpus contains the raw text for each document (identified by a document ID).
The base class is VCorpus
which can be initialized with a data source.
Read plain text files from a directory:
corpus <- VCorpus(DirSource('path/to/documents', encoding = 'UTF-8'), readerControl = list(language = 'de')) # default language is 'en'
encoding
specifies the text format → important for special characters (like German umlauts)A data frame can be converted to a corpus, too. It must contain at least the columns doc_id
, text
:
df_texts ## doc_id text date ## <chr> <chr> <chr> ## 1 Grüne "A. EINLEITUNG\nLiebe Bürgerinnen und Bürger,… 2017… ## 2 Linke "Die Zukunft, für die wir kämpfen: SOZIAL. GE… 2017… ## 3 SPD "Es ist Zeit für mehr Gerechtigkeit!\n2017 is… 2017…
corpus <- VCorpus(DataframeSource(df_texts))
We load a sample of the European Parliament Proceedings Parallel Corpus with English texts. If you want to follow along, download "08textmining-resource.zip" from the tutorial website.
library(tm) europarl <- VCorpus(DirSource('08textmining-resources/nltk_europarl')) europarl
## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 10
inspect
returns information on corpora and documents:
inspect(europarl)
## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 10 ## ## [[1]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 145780 ## ## [[2]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 554441 ## ## [[3]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 228141 ## ## [[4]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 559 ## ## [[5]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 314931 ## ## [[6]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 147766 ## ## [[7]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 170580 ## ## [[8]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 565922 ## ## [[9]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 539764 ## ## [[10]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 372125
Information for the fourth document:
inspect(europarl[[4]])
## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 559 ## ## ## Adoption of the Minutes of the previous sitting Mr President , I simply wanted to pass on some news . ## There was a terrorist attack this morning in Madrid . ## Someone planted a car bomb and one person has died . ## On behalf of my Group , I once again condemn these terrorist acts . ## Thank you , Mrs Fraga Estévez . ## We had heard about this regrettable incident . ## Unfortunately , the terrorist murderers are once again punishing Spanish society . ## I note your comments with particular keenness , as you may expect , given that I too am Spanish . ## ( The Minutes were approved )
Get the raw text of a document with content()
:
head(content(europarl[[1]]))
## [1] " " ## [2] "Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period ." ## [3] "Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful ." ## [4] "You have requested a debate on this subject in the course of the next few days , during this part-session ." ## [5] "In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union ." ## [6] "Please rise , then , for this minute ' s silence ."
We want to investigate word frequencies in our corpus. To count words, we need to transform raw text into a normalized sequence of tokens.
Why normalize text? Consider these documents:
Text processing includes many steps and hence many decisions that have big effect on your results. Several possibilities will be shown here. If and how to apply them depends heavily on your data and your later analysis.
Can you think of an example, where unconditional lower case transformation is bad?
Normalization might involve some of the following steps:
The order is important!
tm
Text normalization can be employed with "transformations" in tm
.
Concept:
tm_map(<CORPUS>, content_transformer(<FUNCTION>), <OPTIONAL ARGS>)
<FUNCTION>
can be any function that takes a character vector, transforms it, and returns the result as character vector<OPTIONAL ARGS>
are fixed arguments passed to <FUNCTION>
tm
comes with many predefined transformation functions like removeWords
, removePunctuation
, stemDocuments
, …tm
A transformation pipeline applied to our corpus (only showing the first three documents):
Original documents:
## name text ## 1 1 Resumption of the session I declare resumed the se... ## 2 2 Adoption of the Minutes of the previous sitting Th... ## 3 3 Middle East peace process ( continuation ) The nex...
europarl <- tm_map(europarl, content_transformer(textclean::replace_contraction)) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords('en')) %>% tm_map(removePunctuation) %>% tm_map(stripWhitespace)
After text normalization:
## name text ## 1 1 resumption session declare resumed session europea... ## 2 2 adoption minutes previous sitting minutes yesterda... ## 3 3 middle east peace process continuation next item c...
DocumentTermMatrix()
takes a corpus, tokenizes it, generates document term matrix (DTM)control
: adjust the transformation from corpus to DTM
dtm <- DocumentTermMatrix(europarl, control = list(wordLengths = c(2, Inf))) inspect(dtm)
## <<DocumentTermMatrix (documents: 10, terms: 14118)>> ## Non-/sparse entries: 42920/98260 ## Sparsity : 70% ## Maximal term length: 24 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs also can commission european mr must parliament ## ep-00-01-17.en 82 46 130 93 128 53 79 ## ep-00-01-18.en 306 200 692 477 356 316 258 ## ep-00-01-19.en 132 107 104 187 157 99 104 ## ep-00-01-21.en 0 0 0 0 1 0 0 ## ep-00-02-02.en 188 118 194 298 220 157 191 ## ep-00-02-03.en 69 59 36 146 73 68 101 ## ep-00-02-14.en 80 63 126 132 86 75 91 ## ep-00-02-15.en 312 255 562 449 365 375 216 ## ep-00-02-16.en 293 183 260 556 360 179 212 ## ep-00-02-17.en 185 142 184 336 307 215 116 ## Terms ## Docs president union will ## ep-00-01-17.en 89 56 94 ## ep-00-01-18.en 203 169 575 ## ep-00-01-19.en 89 114 284 ## ep-00-01-21.en 1 0 0 ## ep-00-02-02.en 183 199 297 ## ep-00-02-03.en 47 50 113 ## ep-00-02-14.en 90 61 123 ## ep-00-02-15.en 246 232 565 ## ep-00-02-16.en 187 391 484 ## ep-00-02-17.en 178 156 241
tm
DTM is a sparse matrix → only values \(\ne 0\) are stored → saves a lot of memoryas.matrix(dtm)[,1:8] # cast to an ordinary matrix and see first 8 terms
## Terms ## Docs aan abandon abandoned abandoning abandonment abattoirs ## ep-00-01-17.en 0 0 0 0 0 0 ## ep-00-01-18.en 0 1 4 0 0 0 ## ep-00-01-19.en 0 1 1 1 0 0 ## ep-00-01-21.en 0 0 0 0 0 0 ## ep-00-02-02.en 0 0 0 0 0 0 ## ep-00-02-03.en 0 1 6 0 0 0 ## ep-00-02-14.en 0 0 1 0 0 0 ## ep-00-02-15.en 1 0 1 0 0 1 ## ep-00-02-16.en 0 0 0 0 0 0 ## ep-00-02-17.en 0 1 6 0 1 0 ## Terms ## Docs abb abbalsthom ## ep-00-01-17.en 0 0 ## ep-00-01-18.en 3 0 ## ep-00-01-19.en 0 0 ## ep-00-01-21.en 0 0 ## ep-00-02-02.en 0 0 ## ep-00-02-03.en 0 0 ## ep-00-02-14.en 0 0 ## ep-00-02-15.en 0 0 ## ep-00-02-16.en 0 0 ## ep-00-02-17.en 0 7