2014-10-26

# The Story

To give an example of working with data while still having fun, I’ve decided to tackle Transformers. Specifically, after finding this dataset someone scraped together from Wikipedia, I thought it’d be interesting to see if we can predict the affiliation (Autobot vs. Decepticon) of a character using one of Scikit-Learn’s naive Bayes classifiers.

Enough chit-chat; let’s roll up our sleeves and dive in!

# Preliminaries

Checking out the data, it looks like the entries are indexed by name (on the far left of the line), which approximates the Transformer’s name. Let’s see how many entries we’re dealing with:

There are 718 entries, but it looks like there may be repeats (i.e., either the data isn’t as clean as we’d hoped, or the names are not a unique enough identifier).

So it looks like the majority of names aren’t repeated, but there are a few; one is even repeated 8 times!

Prowl is the most popular, eh? Let’s look at the different entries with that name.

So these certainly seem to be unique entries; we should uniquify these names, because PyYAML turns YAML into a dictionary (and therefore squashes things under a single heading).

Looks good; let’s save it to a new YAML file.

# Bring out the Python

Let’s start up IPython and start digging.

Loading that raw data into Pandas:

There are 485 entries that have a motto, and every entry with a motto also has an affiliation. As you might expect, though, our data isn’t clean enough yet; there are some affiliations we don’t expect (i.e. not just “Autobot” or “Decepticon”), and the mottos aren’t in the cleanest form.

Consulting some expert knowledge, it looks like

• Maximals can count as Autobots since they’re descended from them
• Predacons can count as Decepticons, since they’re the enemies of the Maximals
• Mini-Cons can be either, so we’ll have to ignore them
• Vehicons are Decepticons
• I was unable to find botconcron; this looks like a transcription error, but there’s only one so we’ll drop it

Cleaning up and creating a new variable numeric_affil with 0 = Decepticon and 1 = Autobot:

Cleaning the mottos:

This still isn’t perfect, but it’s much better than we started with.

Now, we’ve got our cleaned data:

# Scikit-Learn

Let’s import what we’ll need: two vectorizers for transforming text into matrices, and the two naive Bayes classifiers we’ll compare. We’ll also grab the word_tokenize function from NLTK.

To compare the Bernoulli and the Gaussian naive Bayes classifiers, we will need to use our vectorizers on them both. To play fairly, we’ll break out data into train and test groups.

Now to build the two models we wish to compare.

# Comparing Models

If this were a real life task, we may not have test data. To compare our models, then, we won’t resort to the test data just yet; instead, we’ll use leave one out cross-validation.

Though neither of these two classifiers look great, it looks like the binary one wins on our training data. Let’s pretend we’ve made the decision to field that classifier in production and check out its performance on the test data.

# Binary Classifier Results

Here’s the simplest way of looking at our Binary NB’s performance: a confusion matrix. As we can see from this, it looks like our classifier was pretty good at getting Autobots right, but Decepticons were harder to accurately predict—I guess this makes sense, since Decepticons are rather sneaky.

For a more in-depth measure of the classifier’s performance, we can turn to sklearn’s classification_report:

According to this, we can see that our classifier is best at identifying actual Autobots (see the Autobot recall score), but isn’t that great at much else. For more on how to interpret this table, see Scikit-Learn’s documentation.

# Conclusion

It looks like we didn’t do that well at classifying Transformer affiliation from motto; our best classifier only had an F1 score slightly above 50%. One possibility is that we didn’t have enough data; perhaps a classifier trained on thousands of samples would do better than ours that had less than 400. Another possibility is that mottos aren’t that indicative of affiliation to begin with, and we should look for our predictive power elsewhere. In any case, it’s kind of fitting to use Machine Learning on Transformers.