Spiegel Online news topics and COVID-19

A topic modeling approach

November 25, 2020 | Markus Konrad markus.konrad@wzb.eu | WZB / Berlin Social Science Center

This is the main part of a small project to showcase topic modeling with the tmtoolkit Python package via LDA, where I use a corpus of Spiegel Online (SPON) news articles to create a topic model for before and during the COVID-19 pandemic. In the notebook at hand, I load and investigate the generated topic model and identify COVID-19 related topics. I then show how the share of COVID-19 related topics – as a measure of media coverage on COVID-19 – developed during the pandemic and, as a quick example, how this relates to national and global COVID-19 infection rates.

For an introduction to topic modeling via LDA see Introduction to Probabilistic Topic Models (Blei 2012) or Topic modeling made just simple enough (Underwoord 2012).

Currently, a time span from Oct. 2019 to end of Aug. 2020 is covered but I plan to give an update for a time span until end of Nov. 2020. The time span begins well before Jan. 2020 in order to make sure that also enough topics are generated that have nothing to do with the pandemic.

The results of other Python scripts in this repository enter this notebook, namely:

  1. text data preparation in prepare.py
  2. topic model evaluation in tm_evaluation.py
  3. generation of final candidate topic models in tm_final.py

Please have a look at the GitHub repository which contains all necessary files.

Data loading

We load a candidate model that was generated in tm_final.py (for this example, we only investigate the first of the candidate models that are stored in 'data/tm_final_results.pickle').

We see that we have a model with 180 topics. The shape of the document-topic and topic-word distributions tell us also the number of documents (32,921) and the vocabulary size (3,278) for which the model was generated.

We load the document labels, vocabulary and document-term matrix (DTM) which were genereted in prepare.py.

The dimensions of the DTM fit to the number of documents and vocabulary size.

First ten document labels:

First ten terms in vocabulary:

DTM is a large sparse matrix with the expected dimensions:

Investigating the topic model

Let's investigate how quick the probabilities in the topic-word matrix fall with the term ranks. For each topic (i.e. each row in the topic-word matrix), we sort the term probabilities in descending order and plot these sorted topic-specific term probabilities.

Note: I will add a function for such plots to the next major version of tmtoolkit.

This shows us that if we rank the terms in each topic, only the first five to ten terms really represent the topic. All other terms are really close to each other regarding their probabilities, so their ranking becomes more or less meaningless.

This finding helps us to later focus on the top five or so terms in each topic, when we try to interpret the top terms per topic.

When we look closely, we see that the ordered probability distribution for three of the 180 topics seems special compared to the other topics. Let's highlight these topics. We want to identify the three topics that stand out in the plot above, i.e that have the highest term probability at the term ranks 1, 7 and 25:

Let's plot these results. This time, we use a log-scale for the y-axis to be able to better descern the individual topics:

The highlighted topics are #94, #132 and #159. We will later have a closer look at them.

We will now further investigate the topics and try to identify topics of interest to us, i.e. topics related to the COVID-19 pandemic. Before we continue, we apply a transformation to the topic-word matrix: We use topic-word relevance (Sievert & Shirley 2014) which "helps to identify the most relevant words within a topic by also accounting for the marginal probability of each word across the corpus", i.e. this transformation puts a penalty on terms that are more common (have a higher marginal probability) and therefore pushes more "specific" terms to higher ranks in each topic.

We can now try to identify our topics of interest. We can do this computationally, e.g. by using filter_topics() and by specifying a list of search terms. By this, we identify all topics that contain at least one matching term in their top_n=10 terms (remember we found out that 5 to 10 of its top terms sufficiently determine the topic). Note that we use the topic-word relevance matrix topic_word_rel here instead of the topic-word matrix, because we want the top terms list to be determined by the mentioned relevance metric instead of the topic-specific term probabilities.

Alternatively (and you should do this anyway) is to look at the top terms of each topic and try to interpret each topic. This is feasible for 180 topics. By this, you can also identify COVID-19 related topics that don't involve the keywords above but maybe synonyms that you didn't think of. Furthermore, you make sure that the topics in your topic model make sense. As long as there's only a small fraction of "nonsensical" topics (which happens most of the time with LDA), you know your model is okay and you can identify and later exclude such topics.

There are several ways to display and export topic modeling results, specifically the document-topic and topic-word distributions. See this section in the tmtoolkit documentation for the available options. We will export the results to a Excel file, which allows to easily investigate topics and mark them accordingly:

I manually identified the following COVID-19 related topics (yellow background) as well as nonsensical (gray background) in output/tm_final_k180_eta0.7_tw_relevance.xlsx (see the top_topic_word_labels sheet):

Most topics can be interpreted very well (see the Excel file). With a little more than 10% nonsensical topics (mostly consisting of very common words), however, we could probably remove more common words in the preprocessing step and generate a model that consists of a bit fewer topics. But I think for our purposes here the model is sufficient. It's also interesting to see how the model captures different perspectives of the pandemic in separate topics, e.g topic_24 relating to wearing masks, topic_50 relating to relaxing the Corona countermeasures or topic_131 relating to vaccines against the virus.

We also see that we found a few different topics as compared to the "keyword search" approach. We found the following topics additionally:

Let's have a look at the top 10 terms for these additionally found topics:

We can see that by manually investigating the top words per topic, we identified Corona-related topics which didn't find with our set of keywords, because it's hard to create a comprehensive set of keywords.

The following topics we didn't manually identify as COVID-19 related topics:

Let's investigate the top 10 terms for these two topics in order to make sure that we correctly identified these topics as not Corona-related:

topic_93 was not selected as Corona-related since it seems nonsensical (it's actually in the list of nonsensical topics). topic_120 was not selected since the topic is mainly about economy and "coronakrise" as only Corona-related term is only on the 9th rank. However, we should later include these topics in a sensitivity analysis.

We can now exclude nonsensical topics from our model:

This leaves us with 159 topics. Note that the topic indices of the previously identified COVID-19 related topics now don't match to the new document-topic and topic-word distributions anymore. We can update the indices by using new_topic_mapping as returned from exclude_topics():

We can also generate labels for the topics from their top-ranking words. This helps when refering to specific topics.

These are the topic labels for our COVID-19 related topics:

A quick look at the marginal topic distribution shows us the most prominent topics in our corpus:

For the whole corpus (which includes about as much news articles from before the pandemic as from during the pandemic), we can calculate the overall share of Corona-related topics:

We now focus on the chronological development of the share of COVID-19 related topics as well as how this relates to the number of daily COVID-19 cases. We will calculate a share of COVID-19 related topics which represents our measure of COVID-19 media coverage on SPON.

First, we load the corpus metadata that was generated in prepare.py:

We retrieve the publication date for each news article (in the same order as doc_labels):

Now we calculate the marginal distribution of COVID-19 related topics per news article:

Let's put this all together with the document labels and document lengths in a dataframe:

Sorting by marginal distribution gives us the documents with highest share of Corona-related topics:

We want to calculate an estimate of the daily share of Corona-related topics. Since for each day we have several news articles of different lengths, we can compute this estimate as the weighted average of the marginal COVID-19 topic probability of each news article per day. The weights are the share of an article's length compared to the total length of all articles on that day.

Let's first compute the weights:

We check that all weights sum up to 1 per day:

We now compute the daily share of Corona-related topics:

Check that we're within the limits of a valid probability:

Finally, let's plot the share of COVID-19 topics over time:

We see clearly how media coverage on SPON got a first uptick in February when the first reports from the new lung disease in Wuhan arrived (this is how it's called in the first articles on SPON). By March and April media coverage skyrockets with a daily topic share of almost 40%. Note however, that we don't have information on how the respective articles were positioned on the website (i.e. frontpage news or rather positioned at the bottom of the page) and hence this doesn't need to reflect actually percepted media coverage by the public – it only reflects what was published anywhere on the page on that date. Furthermore, the LDA approach that we used for topic modeling views documents as mixtures of topics. This means all documents, including those that cover COVID-19 in some way, also cover other topics to some degree. For example, an article may cover the pandemic and its impact on the economy and hence contain a mixture of, say, 60% COVID-19 related topics, 30% economy related topics and 10% other topics. So you shouldn't think about the topic share displayed in the figure above as something like "x% of all articles covered COVID-19 on a given day". You should rather think of it like that: In the topic mixture of all articles on a given day taken together, x% were related to COVID-19.

It is hence no surprise that the peak of COVID-19 related topic share is "only" at around 40% although you may have had a different impression from the news at that time: First, our sample includes all articles, also those dozens of articles on soccer results or "lifestyle" at the bottom of the page. Second, only the fewest articles solely cover COVID-19. Almost all of them will be a mixture of several topics.

In order to see how this relates to the number of daily COVID-19 cases, we first load data that I fetched from the COVID-19 DataHub (Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.). The dataset is not part of the repository. You can download it at https://covid19datahub.io/articles/data.html#vintage-data.

We create a subset for our observation period and the variables of interest:

The variable confirmed contains the accumlated number of confirmed COVID-19 cases per day. We need the daily cases so we apply diff() on that column per country:

First, let's investigate Germany and create a subset for that:

Let's repeat this for a dataset that represents worldwide daily COVID-19 cases:

We will now investigate how the share of COVID-19 topics relates to the number of daily COVID-19 cases in Germany.

We can see that for daily numbers in the range of up to ~1,000 cases, higher case numbers are associated with a sharp increase in media coverage on SPON in our time frame. After that, the curve flattens until it consolidates at a high level.

Note: No, the LOWESS (aka LOESS) smoother in Python's statsmodels package doesn't support confidence intervals, yet.

We could try to fit a function to this data to get more insights about the relationship between the number of cases in Germany and the share of COVID-19 related topics. We could fit a simple linear model with a single term, i.e. covid19topics ~ dailycases, but this wouldn't account for the damping effect that sets in at higher numbers of cases. Adding a quadratic term, i.e. specifying the model as covid19topics ~ dailycases + dailycases², may be used to account for the damping effect while still providing a simple, interpretable model.

Let's specify such a model:

We can plot the model fit and interpret the results below.

Starting with a share of Corona-related topics of around $0.09$ for no reported cases in Germany, the linear term can be interpretated as an increase of $0.1$ in Corona-related topic share per 1000 cases. The damping effect of the squared term with the negative coefficient sets in at higher numbers of daily cases. For example, at 1000 cases, the dampening effect is only around $-1.17 \cdot 10^{-8} \cdot 1000^2 = -0.0117$, but at 4000 cases it is already $-1.17 \cdot 10^{-8} \cdot 4000^2 = -0.19$.

Let's have a look at the relationship between worldwide daily Corona cases and share of COVID-19 related topics. We can make a similar figure as for the cases in Germany:

We can see that there's not such a clear relationship between the worldwide infection rate and share of COVID-19 related topics on SPON as compared to the data from Germany. There's a sharp increase in the share of COVID-19 related topics on SPON until around 40,000 cases which coincides with the sharp increase of Corona cases in Germany in spring 2020. Despite of increasing daily infections in the world, the share of COVID-19 related topics decreases first and then stays constant at around $15\%$ for 130,000 daily cases and more.

Conclusion

This notebook is supposed to showcase how topic models can be used to identify topics of interest in large-scale text data and how certain properties of these topics like their marginal probability can be used for further analyses.

In the case of SPON, we found that national daily infection numbers clearly drive the volume of media coverage on COVID-19 during our time frame (January 2020 to end of August 2020), which is probably not very suprising. Even though infection rates increased dramatically in the world in summer 2020 (e.g. in Brazil, India and USA), media coverage first decreased and then stayed at a moderate level. But also regarding the national situation, media coverage on SPON stagnated on a high level with increasing infection rates in Germany. It will be interesting to see if this pattern holds with the second "Corona wave", that hit Germany in late autumn 2020, or if there's a "weariness" about media coverage that may lead to decreasing coverage despite of rising infection rates.