Scraping presidential transcripts
To begin, we must scrape the content of all presidential speeches recorded in American history. To do that, I’ll rely on the very handy BeautifulSoup library, and eventually store all data in a pandas dataframe that will be persisted in a pickle file.
Topic modeling and visualization
Now that the raw text of all presidential speeches in American history has been retrieved, we can proceed to light preprocessing before applying Latent Dirichlet Allocation.
In the following 5 cells, we effectively tokenize and remove stopwords from each document (i.e. presidential speech), compute the frequency of each token, and filter out all those that appear less than 10 times in the entire corpus of presidential speeches. Note that I used an ad-hoc threshold of 10, but this should be a parameter that could be played around. Also, the amount of porcessing on each document is intentionally simplistic. Finally, we set up gensim-specific objects that include a dictionary mapping words to integer ids, and a corpus that simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.
Finding the Optimum Number of Topics
Now that the data is ready, we can run a batch LDA (because of the small size of the dataset that we are working with) to discover the main topics in our document.
0 --- 0.025*states + 0.017*united + 0.017*shall + 0.014*state + 0.010*constitution + 0.009*president + 0.009*act + 0.009*congress + 0.008*laws + 0.007*law
1 — 0.026government + 0.007states + 0.007chilean + 0.007men + 0.006sailors + 0.006united + 0.005mr + 0.005german + 0.005police + 0.004vessels
2 — 0.012world + 0.011peace + 0.009people + 0.008america + 0.008freedom + 0.007soviet + 0.006united + 0.006new + 0.005states + 0.005nations
3 — 0.050president + 0.030mr + 0.024think + 0.008secretary + 0.008general + 0.008people + 0.007time + 0.007viet + 0.007going + 0.007nam
4 — 0.011government + 0.007people + 0.007business + 0.006country + 0.005economic + 0.005congress + 0.005world + 0.005federal + 0.005tax + 0.005public
5 — 0.006people + 0.004government + 0.004united + 0.004states + 0.003country + 0.003public + 0.003congress + 0.003question + 0.003going + 0.002time
6 — 0.013states + 0.012government + 0.009united + 0.008congress + 0.007public + 0.005country + 0.005great + 0.005year + 0.004general + 0.004people
7 — 0.011peace + 0.010vietnam + 0.010people + 0.009war + 0.009world + 0.008united + 0.007south + 0.007american + 0.007nations + 0.006states
8 — 0.009world + 0.009congress + 0.008new + 0.008year + 0.007america + 0.006people + 0.006energy + 0.006american + 0.006nation + 0.005government
9 — 0.014government + 0.012people + 0.007states + 0.007union + 0.007constitution + 0.006great + 0.006shall + 0.006men + 0.006country + 0.005free
The display of inferred topics shown above does not really lend itself very well to interpretation. Aside from the fact that you have to read through all the topics, most people will interpret the main themes of each topics differently. This hits right to the core of my mixed feelings towards topic modeling. To be given the ability and opportunity to infer topics from a large set of documents is truly amazing, but I have always personally felt (and maybe that is just me) that the ensuing display of information was lacking. Indeed, I have found that the output of typical topic modeling techniques does not lend itself very well to visualization and - in the case of presentations to the uninitiated - interpretation. However, I recently came across the LDAviz R library developed by Kenny Shirley and Carson Sievert, which to paraphrase their words is a D3.js interactive visualization that's designed help you interpret the topics in a topic model fit to a corpus of text using LDA.
Here, we use the great Python extension port of the LDAviz R library, available on GitHub at the following URL https://github.com/bmabey/pyLDAvis. Two attractive features of pyLDAviz are its ability to help interpret the topics extracted from a fitted LDA model, but also the fact that it can be easily incorporated within an iPython notebook in nothing more than two lines of code!
Tracking and visualizing topics propensity over time
Now that we have shown how results gathered from topic modeling methods such as LDA can be visualized in a intuitive way, we can move to additional data analysis. In particular, it would be interesting to uncover the temporal variation of topics across American History. I would personally be very curious to find out whether topic modeling can reverse-engineer the major events in American History. In the next step, we produce a dataframe where each row represents a speech and each of the 20 columns represent a topic. Each cell in the dataframe represents the probability that a given topic was assigned to a speech.
[(7, 0.10493554997876908), (10, 0.011621459517891617), (15, 0.86674743636700446)]
topic_0 topic_1 topic_2 topic_3 topic_4 \
lincoln|July 4, 1861 0.000011 0.000011 0.000011 0.011112 0.000011
buchanan|February 24, 1859 0.000033 0.000033 0.000033 0.203170 0.000033
reagan|November 11, 1988 0.000526 0.197846 0.000526 0.000526 0.000526
tyler|February 20, 1845 0.000114 0.000114 0.000114 0.000114 0.000114
eisenhower|January 17, 1961 0.000063 0.000063 0.000063 0.000063 0.000063
topic_5 topic_6 topic_7 topic_8 topic_9 \
lincoln|July 4, 1861 0.008504 0.000011 0.064621 0.051711 0.752269
buchanan|February 24, 1859 0.000033 0.000033 0.002136 0.249423 0.032370
reagan|November 11, 1988 0.453219 0.000526 0.000526 0.000526 0.000526
tyler|February 20, 1845 0.000114 0.000114 0.014633 0.361549 0.128588
eisenhower|January 17, 1961 0.615735 0.000063 0.000063 0.000063 0.000063
... topic_12 topic_13 topic_14 topic_15 \
lincoln|July 4, 1861 ... 0.000011 0.000011 0.000011 0.111633
buchanan|February 24, 1859 ... 0.000033 0.000033 0.006279 0.490955
reagan|November 11, 1988 ... 0.101381 0.239133 0.000526 0.000526
tyler|February 20, 1845 ... 0.000114 0.000114 0.000114 0.493404
eisenhower|January 17, 1961 ... 0.000063 0.074416 0.000063 0.000063
topic_16 topic_17 topic_18 topic_19 \
lincoln|July 4, 1861 0.000011 0.000011 0.000011 0.000011
buchanan|February 24, 1859 0.015235 0.000033 0.000033 0.000033
reagan|November 11, 1988 0.000526 0.000526 0.000526 0.000526
tyler|February 20, 1845 0.000114 0.000114 0.000114 0.000114
eisenhower|January 17, 1961 0.010800 0.000063 0.000063 0.000063
president year
lincoln|July 4, 1861 lincoln 1861
buchanan|February 24, 1859 buchanan 1859
reagan|November 11, 1988 reagan 1988
tyler|February 20, 1845 tyler 1845
eisenhower|January 17, 1961 eisenhower 1961
[5 rows x 22 columns]
Finally, we can compute the normalized frequency of topics by year and plot these as a time-series using the dygraphs library.
At this point, I’m going to do something that I am not very proud of and proceed to some nasty context switching. Although I played around with the charts
library, I was not satisified with the results and temporarily switched to R in order to leverage the dygraphs library. Thankfully, Jupyter notebooks have plenty of magic
that make it easy to call R from the notebook itself!
Tracking and visualizing topics propensity over time
Clustering individual presidential speeches
We can also wrangle the data a little bit more in order to visualize how each individual speeches cluster together. This time, we use document-topic distributions and apply the t-sne dimensionality reduction algorithm to map all speeches into two-dimensional space. Roughly, t-sne is considered to be useful because of its property to conserve the overall topology of the data, so that neighboring (i.e. similar) speeches will hopefully be mapped into neighboring locations in two-dimensional space. Other well-known clustering techniques such as k-means or MDS would likely be just as adequate for this exercise, but I’ve had good fortune when using t-sne, so am unwisely (and perharps not very smartly) sticking to it here.
t-SNE: 13 sec
0 1
lincoln|July 4, 1861 12.246211 -4.594903
buchanan|February 24, 1859 13.982249 -1.675186
reagan|November 11, 1988 -7.665759 4.714818
tyler|February 20, 1845 11.953091 1.884652
eisenhower|January 17, 1961 -13.193183 -3.790267
We can now leverage the mpld3
library to display the t-sne clusters inline. The interactive figure below shows the 2-dimensional t-sne coordinates of all 880 presidential speeches in American history. One of the challenges here was to generate distinct colors to map the different presidents, and I don’t think I did a particularly good job at it (the figure could probably benefit from a legend too, but I opted to waste my time on adding tooltip functionnality instead!)
Clustering presidents
Finally, we can look at how each president cluster with one another based on the entire corpus of their speeches. Since we are clustering based on their speeches, it would be reasonable to expect that presidents with similar ideologies, or that were confronted to similar historical situations such as internal or external conflicts, economic depression or periods of significant political instability, should cluster close to one another. For this exercise, we use the CountVectorizer
function from the scikit-learn
library to generate president-specific corpuses built using all their speeches. As a result, we obtain a term-frequency matrix where each row is a president, columns are words, and cells represents the total number of times a president used a given word. Note that I mostly relied on defaut parameters for the CountVectorizer
, but we could potentially change a lot of parameters including stopwords list, minimum/maximum document-frequency, maximum number of words, n-grams generations and more…
0 1 president
0 -504.216414 -139.964997 madison
1 129.954414 639.936735 taft
2 1452.238353 -825.087538 clinton
3 735.346376 -492.296639 carter
4 -147.329302 433.361482 buchanan