To give an example of working with data while still having fun, I’ve decided to
tackle Transformers.
Specifically, after finding this
dataset someone scraped
together from Wikipedia, I thought it’d be interesting to see if we can
predict the affiliation (Autobot vs. Decepticon) of a character using one of
Scikit-Learn’s
naive Bayes
classifiers.
Enough chit-chat; let’s roll up our sleeves and dive in!
Preliminaries
Checking out the data, it looks like the entries are indexed by name (on the
far left of the line), which approximates the Transformer’s name.
Let’s see how many entries we’re dealing with:
There are 718 entries, but it looks like there may be repeats (i.e., either the
data isn’t as clean as we’d hoped, or the names are not a unique enough
identifier).
So it looks like the majority of names aren’t repeated, but there are a few;
one is even repeated 8 times!
Prowl is the most popular, eh? Let’s look at the different entries with that
name.
So these certainly seem to be unique entries; we should uniquify these names,
because PyYAML turns YAML into a dictionary (and therefore squashes things
under a single heading).
Looks good; let’s save it to a new YAML file.
Bring out the Python
Let’s start up IPython and start digging.
Loading that raw data into Pandas:
There are 485 entries that have a motto, and every entry with a motto also has
an affiliation.
As you might expect, though, our data isn’t clean enough yet; there are some
affiliations we don’t expect (i.e. not just “Autobot” or “Decepticon”), and the
mottos aren’t in the cleanest form.
Maximals can count as Autobots since they’re descended from them
Predacons can count as Decepticons, since they’re the enemies of the Maximals
Mini-Cons can be either, so we’ll have to ignore them
Vehicons are Decepticons
I was unable to find botconcron; this looks like a transcription error, but
there’s only one so we’ll drop it
Cleaning up and creating a new variable numeric_affil with 0 = Decepticon and
1 = Autobot:
Cleaning the mottos:
This still isn’t perfect, but it’s much better than we started with.
Now, we’ve got our cleaned data:
Scikit-Learn
Let’s import what we’ll need: two vectorizers for transforming text into
matrices, and the two naive Bayes classifiers we’ll compare.
We’ll also grab the word_tokenize function from NLTK.
To compare the Bernoulli and the Gaussian naive Bayes classifiers, we will need
to use our vectorizers on them both.
To play fairly, we’ll break out data into train and test groups.
Now to build the two models we wish to compare.
Comparing Models
If this were a real life task, we may not have test data.
To compare our models, then, we won’t resort to the test data just yet;
instead, we’ll use leave one out
cross-validation.
Though neither of these two classifiers look great, it looks like the binary
one wins on our training data.
Let’s pretend we’ve made the decision to field that classifier in production
and check out its performance on the test data.
Binary Classifier Results
Here’s the simplest way of looking at our Binary NB’s performance: a confusion
matrix.
As we can see from this, it looks like our classifier was pretty good at
getting Autobots right, but Decepticons were harder to accurately predict—I
guess this makes sense, since Decepticons are rather sneaky.
For a more in-depth measure of the classifier’s performance, we can turn to
sklearn’s classification_report:
According to this, we can see that our classifier is best at identifying actual
Autobots (see the Autobot recall score), but isn’t that great at much else.
For more on how to interpret this table, see Scikit-Learn’s
documentation.
Conclusion
It looks like we didn’t do that well at classifying Transformer affiliation
from motto; our best classifier only had an F1 score slightly above 50%.
One possibility is that we didn’t have enough data; perhaps a classifier
trained on thousands of samples would do better than ours that had less than
400.
Another possibility is that mottos aren’t that indicative of affiliation to
begin with, and we should look for our predictive power elsewhere.
In any case, it’s kind of fitting to use Machine Learning on Transformers.