Bring Balance to the Data
When the Yays far outweigh the Nays, and vice versa
One would probably assume that a 95% accuracy is a damn fine model, but in many cases that is about as bad as it gets. For a classification problem, the more the predicted value counts vary from each other, the more you have issues with imbalanced classes.
To demonstrate this concept, let's look at the issue of West Nile Virus in Chicago. Due to the heavy population density and the species of mosquito that live in the area, Chicago summers are a hot bed of activity. Using weather data and mosquito trap data, it is possible to build a predictive model see when/where the virus would be most prevalent. While West Nile is very prevalent in Chicago, When looking at the traps, we see that the number of observations is, in fact, quite low (about 5%)
With that information, spraying can be done to prevent the spread of the virus. However, If you asked a computer to make the most accurate model it can to deal with the issue, it might return a confusion matrix like this.
This doesn't help anyone. The model is essentially saying never, ever spray for mosquito. This isn't due to malaise on the part of the computer or the fact that it is not susceptible to the West Nile Virus, it just has so much information on negative observations and so few on positive. So dealing with this imbalance is very important for dealing with the question of when to spray.
So one method of dealing with this issue to by bootstrapping. This is when you take random samples of the under represented class and duplicate them till the problem has gone away. In our example we used the Balance Bagging Classifier in the imblearn.ensemble module in Python. This method uses a chain of RandomUnderSample to create a new data set to work from.
Another method is to change the mind frame of where you are working and move away from accuracy and instead work from either recall or precision. As opposed to accuracy where you are trying to be right as often as you can, trying to maximize recall and precision has more to do with minimizing false negatives or positives, respectfully. In our case, since we want to spray and be wrong more than we want to not spray and be wrong, we will optimize our model on recall. With these 2 methods, we instead get the following result.
For the code and a full write up on this project, click here.
Comments
Post a Comment