The Story
To give an example of working with data while still having fun, I’ve decided to tackle Transformers. Specifically, after finding this dataset someone scraped together from Wikipedia, I thought it’d be interesting to see if we can predict the affiliation (Autobot vs. Decepticon) of a character using one of Scikit-Learn’s naive Bayes classifiers.
Enough chit-chat; let’s roll up our sleeves and dive in!
Preliminaries
Checking out the data, it looks like the entries are indexed by name (on the far left of the line), which approximates the Transformer’s name. Let’s see how many entries we’re dealing with:
$ grep -P '^\S' infobox_transformers_character.yaml | wc -l
718
There are 718 entries, but it looks like there may be repeats (i.e., either the data isn’t as clean as we’d hoped, or the names are not a unique enough identifier).
$ grep -P '^\S' infobox_transformers_character.yaml |\
uniq -c |\
awk '{print $1}' |\
sort -n |\
uniq -c
250 1
85 2
45 3
19 4
12 5
2 6
1 7
1 8
So it looks like the majority of names aren’t repeated, but there are a few; one is even repeated 8 times!
$ grep -P '^\S' infobox_transformers_character.yaml |\
uniq -c |\
sort -nr |\
head
8 Prowl_(Transformers):
7 Bonecrusher_(Transformers):
6 Ultra_Magnus:
6 Ironhide:
5 Thrust_(Transformers):
5 Swindle_(Transformers):
5 Snarl_(Transformers):
5 Skydive_(Transformers):
5 Scavenger_(Transformers):
5 Razorclaw:
Prowl is the most popular, eh? Let’s look at the different entries with that name.
$ grep -P -A5 '^Prowl' infobox_transformers_character.yaml
# output omitted
So these certainly seem to be unique entries; we should uniquify these names,
because PyYAML
turns YAML into a dictionary (and therefore squashes things
under a single heading).
$ cat infobox_transformers_character.yaml |\
perl -ple 'BEGIN{$n = 0;} if (m/(^\S+)[:]/) { $_ = $1 . ($n++) . ":"; }'
# output omitted
Looks good; let’s save it to a new YAML file.
$ perl -ple 'BEGIN{$n = 0;} if (m/(^\S+)[:]/) { $_ = $1 . ($n++) . ":"; }' \
infobox_transformers_character.yaml > unique_names.yaml
Bring out the Python
Let’s start up IPython and start digging.
>>> import yaml
>>> with open('unique_names.yaml') as f:
raw_data = yaml.load(f)
Loading that raw data into Pandas
:
>>> import numpy as np
>>> import pandas as pd
>>> messy_df = pd.DataFrame(raw_data).transpose()
>>> has_motto = messy_df[messy_df.motto.notnull()]
>>> len(has_motto)
485
>> any(has_motto.affiliation.isnull())
False
There are 485 entries that have a motto, and every entry with a motto also has an affiliation. As you might expect, though, our data isn’t clean enough yet; there are some affiliations we don’t expect (i.e. not just “Autobot” or “Decepticon”), and the mottos aren’t in the cleanest form.
>>> has_motto.affiliation.unique()
array(['Decepticon', 'Autobot', 'Maximal', 'Predacon',
'Autobot, later Maximal', 'Mini-Con',
'Predacon, later Maximal then Decepticon',
'Maximal, later Autobot', 'Vehicon',
'Maximal, \\n Former Predacon', 'Mutant',
'Decepticon, later Autobot', 'None/Decepticon',
'Decepticon, later Predacon', 'botconcron'],
dtype=object)
>>> has_motto.affiliation.value_counts()
Autobot 213
Decepticon 143
Mini-Con 40
Maximal 28
Predacon 23
Vehicon 9
Maximal, later Autobot 7
Decepticon, later Autobot 6
Mutant 4
Autobot, later Maximal 4
Maximal, \n Former Predacon 3
None/Decepticon 2
Decepticon, later Predacon 1
botconcron 1
Predacon, later Maximal then Decepticon 1
Consulting some expert knowledge, it looks like
- Maximals can count as Autobots since they’re descended from them
- Predacons can count as Decepticons, since they’re the enemies of the Maximals
- Mini-Cons can be either, so we’ll have to ignore them
- Vehicons are Decepticons
- I was unable to find botconcron; this looks like a transcription error, but there’s only one so we’ll drop it
Cleaning up and creating a new variable numeric_affil
with 0 = Decepticon and
1 = Autobot:
>>> def normalize_aff(aff):
if any(bad in aff for bad in 'Decepticon Predacon Vehicon'.split()):
return 0
elif any(good in aff for good in 'Autobot Maximal'.split()):
return 1
else:
return np.nan
>>> has_motto['numeric_affil'] = has_motto.affiliation.apply(normalize_aff)
Cleaning the mottos:
>>> def clean_motto(m):
return m.lower()
.replace('\u2019 ', "'")
.replace('\u2019', "'")
.replace('\\"', '')
.replace('<br>', '')
.replace('</br>', '')
>>> has_motto['clean_motto'] = has_motto.motto.apply(clean_motto)
This still isn’t perfect, but it’s much better than we started with.
Now, we’ve got our cleaned data:
>>> data = has_motto.dropna()[['numeric_affil', 'clean_motto']]
>>> data.columns = 'affiliation motto'.split()
Scikit-Learn
Let’s import what we’ll need: two vectorizers for transforming text into
matrices, and the two naive Bayes classifiers we’ll compare.
We’ll also grab the word_tokenize
function from NLTK
.
>>> from nltk import word_tokenize
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.naive_bayes import BernoulliNB
>>> from sklearn.naive_bayes import GaussianNB
>>> binary_vectorizer = CountVectorizer(
binary=True, min_df=1, tokenizer=word_tokenize)
>>> tfidf_vectorizer = TfidfVectorizer(
min_df=1, tokenizer=word_tokenize)
>>> binary_nb, gaussian_nb = BernoulliNB(), GaussianNB()
To compare the Bernoulli and the Gaussian naive Bayes classifiers, we will need to use our vectorizers on them both. To play fairly, we’ll break out data into train and test groups.
>>> from sklearn.cross_validation import train_test_split
>>> binary_motto = binary_vectorizer.fit_transform(data.motto)
>>> tfidf_motto = tfidf_vectorizer.fit_transform(data.motto)
>>> (binary_motto_train,
binary_motto_test,
affil_train,
affil_test) = train_test_split(
binary_motto, data.affiliation, test_size=0.2, random_state=1729)
>>> (tfidf_motto_train,
tfidf_motto_test,
affil_train,
affil_test) = train_test_split(
tfidf_motto, data.affiliation, test_size=0.2, random_state=1729)
Now to build the two models we wish to compare.
>>> binary_nb.fit(binary_motto_train, affil_train)
>>> gaussian_nb.fit(tfidf_motto_train.toarray(), affil_train)
Comparing Models
If this were a real life task, we may not have test data. To compare our models, then, we won’t resort to the test data just yet; instead, we’ll use leave one out cross-validation.
>>> from sklearn.cross_validation import cross_val_score, LeaveOneOut
>>> binary_cv_score = cross_val_score(binary_nb,
binary_motto_train,
affil_train,
cv=LeaveOneOut(len(affil_train)))
>>> gaussian_cv_score = cross_val_score(gaussian_nb,
tfidf_motto_train.toarray(),
affil_train,
cv=LeaveOneOut(len(affil_train)))
>>> print("Binary CV Score: {0:.5f}".format(binary_cv_score.mean()))
>>> print("Gaussian CV Score: {0:.5f}".format(gaussian_cv_score.mean()))
Binary CV Score: 0.65341
Gaussian CV Score: 0.58523
Though neither of these two classifiers look great, it looks like the binary one wins on our training data. Let’s pretend we’ve made the decision to field that classifier in production and check out its performance on the test data.
Binary Classifier Results
>>> from sklearn import metrics
>>> binary_prediction = binary_nb.predict(binary_motto_test)
Here’s the simplest way of looking at our Binary NB’s performance: a confusion matrix. As we can see from this, it looks like our classifier was pretty good at getting Autobots right, but Decepticons were harder to accurately predict—I guess this makes sense, since Decepticons are rather sneaky.
>>> print(metrics.confusion_matrix(affil_test, binary_prediction))
Predicted Decepticon Predicted Autobot
Was Decepticon 8 28
Was Autobot 6 46
For a more in-depth measure of the classifier’s performance, we can turn to
sklearn
’s classification_report
:
>>> print(metrics.classification_report(
affil_test,
binary_prediction,
target_names='Decepticon Autobot'.split()))
precision recall f1-score support
Decepticon 0.57 0.22 0.32 36
Autobot 0.62 0.88 0.73 52
avg / total 0.60 0.61 0.56 88
According to this, we can see that our classifier is best at identifying actual Autobots (see the Autobot recall score), but isn’t that great at much else. For more on how to interpret this table, see Scikit-Learn’s documentation.
Conclusion
It looks like we didn’t do that well at classifying Transformer affiliation from motto; our best classifier only had an F1 score slightly above 50%. One possibility is that we didn’t have enough data; perhaps a classifier trained on thousands of samples would do better than ours that had less than 400. Another possibility is that mottos aren’t that indicative of affiliation to begin with, and we should look for our predictive power elsewhere. In any case, it’s kind of fitting to use Machine Learning on Transformers.