Identifying proper noun categories using machine learning
JONATHAN COULTON: Well, if you have listened to Ask Me Another before, then this game will probably sound familiar to you. It is one of my favorites. It's called This, That, or the Other. We will name an item and all you have to do is tell us which of three categories that item belongs to. Today's categories are grains, world currencies or Pokémon characters.
That was an excerpt from the transcript of an old episode of NPR’s whimsical trivia/puzzle radio show Ask Me Another
Other editions of the show have asked participants to tell
between:
- A tech company, a car model, or a Star Wars location;
- Harry Potter spell, a prescription drug, and a piece of IKEA furniture;
- A type of cheese, a dance move, or a Moby Dick
character;
and my personal favourite, - A type of pasta, a title of an opera, or a character on The Sopranos.
If you were to guess, based on what you know about how each category sounds like, would you be likely to be right more than a third of the time? Rather, could we train a model to pick up on features that are particular to each category, and then get that model to perform better than random on unseen examples?
Okay, so the data collection part isn't very difficult for a number of those categories, thanks to Wikipedia maintaining parseable articles in list form, like List of Pokémon1. I managed to easily procure lists of
- Currencies
- Pokémon
- Pastas
- Cheese
- Locations in the Star Wars Universe
n, a, b, o, ^n, na, ab, bo, oo, o$
The ^ and the $ are special symbols I used to indicate beginning and the end of a word respectively.
If you've worked in the field of Natural Language Processing, you'll recognize these features as analogues of unigrams and bigrams in language models.
Next, I trained a Naïve Bayes model in in Python, using the excellent NLTK libraries. I picked categories two at a time, but Naïve Bayes allows for an extension to any number of targets very naturally. Feel free to play around with the code. You can find it on github here.
With just those simple kinds of features, I was able to get upwards of 80% accuracy2 on most pairs. In fact, on a pair like Cheese vs Pasta (I would totally watch a movie with that title.(Oi! In some circles I could pass that for humour!)), which seem like a difficult pair to classify, I could get as much as 92% accuracy.
Here’s a twist on the problem statement: What if you were designing the game, and wanted to pick the hardest items to guess? Actually, we can directly extend the results of the earlier part to get these. We simply need the items that the algorithm misclassified. So here’s a test to see how you do on the toughies. In the following set, can you guess if it’s a pasta, or a location in the Star Wars universe?
Here’s a twist on the problem statement: What if you were designing the game, and wanted to pick the hardest items to guess? Actually, we can directly extend the results of the earlier part to get these. We simply need the items that the algorithm misclassified. So here’s a test to see how you do on the toughies. In the following set, can you guess if it’s a pasta, or a location in the Star Wars universe?
- Bestine
- Quelli
- Falleen
- Alfabeto
- Felucia
- Sorprese
- Sulorine
- Egg barley
Answers:
Star Wars Locations: 1, 2, 3, 5, 7,Pastas: 4, 6, 8 (Yeah, the last one was a giveaway3 )
Do well? Pat yourself on the back, for today, you have outwitted a machine.
So why does the model classify these incorrectly? This might provide some insight. Here are
the top 10 features the model picked:
Feature/Value
|
Dominant
cat: Lesser cat
|
Ratio
of Occurence
|
li = True
|
pastas : starWars
|
23.2 : 1.0
|
ti = True
|
pastas : starWars
|
12.8 : 1.0
|
et = True
|
pastas : starWars
|
10.8 : 1.0
|
i$ = True
|
pastas : starWars
|
9.0 : 1.0
|
^p = True
|
pastas : starWars
|
8.8 : 1.0
|
tt = True
|
pastas : starWars
|
8.1 : 1.0
|
length = 5
|
starWa : pastas
|
7.9 : 1.0
|
ci = True
|
pastas : starWars
|
6.9 : 1.0
|
f = True
|
pastas : starWars
|
6.3 : 1.0
|
nn = True
|
pastas : starWars
|
6.2 : 1.0
|
That’s that for these models. Here are some suggestions for other cool things you could do with the code if you have a teeny weeny bit of coding experience (No math/stats/machine learning experience required):
- Find out if your name looks like a grain or a kind pasta (And then go around claiming "I just don't get grain-people. Pasta-ites FTW!")
- Gather more lists (Simpsons characters, varieties of chili, brands of cosmetics, ….) and see which ones look like which, er, other ones. (The data is represented as simply a text file, with one item on each line)
- See which the cheesiest pastas are! (I'm so terribly sorry. I'll never ever try to be funny again. Ever.)
Don't you?
Don't leave me hanging here, guys.
Guys?
Fine. I'll just make the app.
Footnotes
1 While on that topic, check out these weird lists on Wikipedia:
List of helicopter prison escapes, List of dogs faithful after hteir masters' deaths, List of fictional swords, List of fictional Jews
2 I’m using accuracy as simply #correct predictions/(#correct predictions + #incorrect predictions)
3 Only goes to show that there are more/better features to be derived here.
4 I was working with a slightly different version here, which included a variable for length. Don't be thrown off by that.
(If you haven't already tried) Mentalfloss has hordes of these in the form of quizzes : http://mentalfloss.com/quizzes
ReplyDeleteThese are way too addictive! I loved 'Celebrity Baby Name or Computer Virus?'. Unfortunately, I only scored 45%.
ReplyDelete