Decision Boundary in Python

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Definition of Decision Boundary

In classification problems with two or more classes, a decision boundary is a hypersurface that separates the underlying vector space into sets, one for each class. Andrew Ng provides a nice example of Decision Boundary in Logistic Regression.

We know that there are some Linear (like logistic regression) and some non-Linear (like Random Forest) decision boundaries. Let’s create a dummy dataset of two explanatory variables and a target of two classes and see the Decision Boundaries of different algorithms.

Create the Dummy Dataset

We will create a dummy dataset with scikit-learn of 200 rows, 2 informative independent variables, and 1 target of two classes.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=1)
  

Create the Decision Boundary of each Classifier

We will compare 6 classification algorithms such as:

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • Support Vector Machines (SVM)
  • Naive Bayes
  • Neural Network

We will work with the Mlxtend library. For simplicity, we decided to keep the default parameters of every algorithm.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 
from sklearn.neural_network import MLPClassifier



# Initializing Classifiers
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = RandomForestClassifier()
clf4 = SVC(gamma='auto')
clf5 = GaussianNB()
clf6 = MLPClassifier()

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
%matplotlib inline  

gs = gridspec.GridSpec(3, 2)

fig = plt.figure(figsize=(14,10))

labels = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'Naive Bayes', 'Neural Network']
for clf, lab, grd in zip([clf1, clf2, clf3, clf4, clf5, clf6],
                         labels,
                         [(0,0), (0,1), (1,0), (1,1), (2,0), (2,1)]):

    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(lab)

plt.show()
 
decision boundary

Discussion

Clearly, the Logistic Regression has a Linear Decision Boundary, where the tree-based algorithms like Decision Tree and Random Forest create rectangular partitions. The Naive Bayes leads to a linear decision boundary in many common cases but can also be quadratic as in our case. The SVMs can capture many different boundaries depending on the gamma and the kernel. The same applies to the Neural Networks.

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.