I've recently rediscovered my affection for xkcd [1], and what better way to show it than to perform a data analysis on the comic's archives. In this post, we use Latent Dirichlet Allocation (LDA) to mine for topics from xkcd strips, and see if it lives up to it's tagline of "A webcomic of romance, sarcasm, math, and language"
The first thing to realize is that this problem is intrinsically different from classifying documents into topics - because the topics are not known beforehand (This is also, in a way, what the 'latent' in 'latent dirichlet allocation' means). We want to simultaneously solve the problems of discovering topic groups in our data, and assign documents to topics (The assignment metaphor isn't exact, and we'll see why in just a sec)
A conventional approach to grouping documents into topics might be to cluster them using some features, and call each cluster one topic. LDA goes a step further, in that it allows the possibility of a document to arise from a combination of topics. So, for example, comic 162 might be classified as 0.6 physics, 0.4 romance.
The Data and the Features
Processing and interpreting the contents of the images would be a formidable task, if not an impossible one. So for now, we're going to stick to the text of the comics. This is good not only because text is easier to parse, but also because it probably contains the bulk of the information. Accordingly, I scraped the transcriptions for xkcd comics - an easy enough task from the command line. (Yes, they are crowd-transcribed! You can find a load of webcomics transcribed at OhnoRobot, but Randall Munroe has conveniently put them in the source of the xkcd comic page itself)
Cleaning up the text required a number of judgement calls, and I usually went with whatever was simple. I explain these in comments in the code - Feel free to alter it and do this in a different way.
Finally, the transcripts are converted into a bag of words - exactly the kind of input LDA works with. The code is shared via github
What to Expect
I'm not going to cover the details of how LDA works (There is an easy to understand, layman explanation here, and a rigorous, technical one here), but I'll tell you what output we're expecting: LDA is a generative modeling technique, and is going to give us k topics, where each 'topic' is basically a probability distribution over the set of all words in our vocabulary (all words ever seen in the input data). The values indicate the probability of each word being selected if you were trying to generate a random document from the given topic.
Each topic can then be interpreted from the words that are assigned the highest probabilities.
The first thing to realize is that this problem is intrinsically different from classifying documents into topics - because the topics are not known beforehand (This is also, in a way, what the 'latent' in 'latent dirichlet allocation' means). We want to simultaneously solve the problems of discovering topic groups in our data, and assign documents to topics (The assignment metaphor isn't exact, and we'll see why in just a sec)
A conventional approach to grouping documents into topics might be to cluster them using some features, and call each cluster one topic. LDA goes a step further, in that it allows the possibility of a document to arise from a combination of topics. So, for example, comic 162 might be classified as 0.6 physics, 0.4 romance.
The Data and the Features
Processing and interpreting the contents of the images would be a formidable task, if not an impossible one. So for now, we're going to stick to the text of the comics. This is good not only because text is easier to parse, but also because it probably contains the bulk of the information. Accordingly, I scraped the transcriptions for xkcd comics - an easy enough task from the command line. (Yes, they are crowd-transcribed! You can find a load of webcomics transcribed at OhnoRobot, but Randall Munroe has conveniently put them in the source of the xkcd comic page itself)
Cleaning up the text required a number of judgement calls, and I usually went with whatever was simple. I explain these in comments in the code - Feel free to alter it and do this in a different way.
Finally, the transcripts are converted into a bag of words - exactly the kind of input LDA works with. The code is shared via github
What to Expect
I'm not going to cover the details of how LDA works (There is an easy to understand, layman explanation here, and a rigorous, technical one here), but I'll tell you what output we're expecting: LDA is a generative modeling technique, and is going to give us k topics, where each 'topic' is basically a probability distribution over the set of all words in our vocabulary (all words ever seen in the input data). The values indicate the probability of each word being selected if you were trying to generate a random document from the given topic.
Each topic can then be interpreted from the words that are assigned the highest probabilities.
The Results
I decided to go for four topics, since that's how many Randall uses to describe xkcd (romance, sarcasm, math, language). Here are the top 10 words from each topic that LDA came up with:
(Some words are stemmed[2], but the word root is easily interpretable)
Topic 1: click, found, type, googl, link, internet, star, map, check, twitter. This is clearly about the internet
Topic 2: read, comic, line, time, panel, tri, label, busi, date, look. This one's a little fuzzy, but I think it's fair to call it meta discussions
Topic 3: yeah, hey, peopl, world, love, sorri, time, stop, run, stuff. No clear topic here. I'm just going to label this small talk
Topic 4: blam, art, ghost, spider, observ, aww, kingdom, employe, escap, hitler. A very interesting group - Let's call this whimsical phantasmagoria
I arbitrarily took the top 10 words from each topic, but we could wonder how many words actually 'make' the topic[3]. This plot graphs the probability associated with the top 100 words for each topic, sorted from most to least likely.
And individual comics can be visualized as a combination of the topic fractions they are comprised of. A few comics (Each horizontal bar is one comic. The length of the coloured subsegments denote the probability of the comic belonging to a particular topic):
As expected, most comics draw largely from one or two topics, but a few cases are difficult to assign and go all over the place.
So, what does this mean?
Well- I-
Well- I-
I actually don't have anything to say, so I'm just going to leave this here:
Notes
[1] There was some love lost in the middle, but the rediscovery started mostly after this comic
[2] Stemming basically means reducing "love", "lovers", "loving", and other such lexical variants to the common root "lov", because they are all essentially talking about the same concept
[3] Like I pointed out earlier in this post, each topic assigns a probability to each word, but we can think of the words with dominant probabilities to be what define the topic