Three Horror Writers Walk into a Predictive Model

- July 24, 2018

How I learned to stop worrying and love the vectorizor

In the world of horror writers, there are few that can top the likes of H.P. Lovecraft, Mary Shelley, and Edgar Allan Poe (maybe Stephen King, but that's mostly due to volume. Not to say he's bad. I've actually never read any of his stuff. This aside has gone on too long.) An avid fan could probably tell the difference between their writings, but how well could we train a computer to do the same. Computers are notorious for their lack if interest in literature, so getting them to absorb the text in a language they understand it tricky. Furthermore, getting those converted text to play nicely with the other features is a bit of a mess, so her is vecorizing and feature union, spooky style. But first, wordclouds!

Inside of our wordy Cthulhu, We see a number of words that don't seem to have to much significance

And the same with Poe's Raven. The vocabulary seems pretty similar with the exception being the word "upon"

Mary Shelly on the other hand has more variety with positive words like "love" and "life" and even names like "Raymond." Nothing specific can be gleamed from this exercise, but its pretty to look at.

If you want to create a model that you can interpret after, there are really two options. CountVectorizor and TF-IDS Vectorizor. CountVectorizor, the simpler of the 2 options gives a count of certain words and ranks them based on their frequency. TF-IDF (Term Frequency - Inverse Document Frequency) adds an added weight to the proceedings by downplaying words that appear often in a document. This helps common words (that are not excluded by the stop words) to not overpower the model. I used TF-IDF to vectorize the sentences.

I also made a number of features based on the text like sentiment factor, parts of speech, and the most syllables in a sentence. Now making a model on the vecorized text alone is easy and with the created features alone is easy as well, but putting them together will require some feature union.

First a class must be created that take in feature names and their vecorizing methods.

After that, separate pipelines are created for the text features and the numeric features. The text pipeline had the vectorizer put into it while the numerical is passed through. They will be fit and transformed in the same manner.