PCA with K-Nearest Neighbors on the Iris dataset

Iris dataset is a famous dataset with 150 instances (so a small dataset). Each instance has 4 features and each label can be one of these 3 species of iris flower: setosa, versicolor, virginica.
Even if the dataset is very simple, it's not possible to visualize data while preserving all the original information.
We will use the PCA to reduce dimensionality to 2 new features and plot data and the classification plane done by the selected model (check out my previous article about PCA if you didn't do it yet: Implementation of PCA with scikit-learn and Matplotlib ).
As model we choose a simple multi-class (i.e. the classification task has more than two classes) and non-parametric classifier: the KNN classifier.

In [1]:
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
from IPython.display import IFrame
IFrame('https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm', width=900, height=450)
Out[1]:
In [2]:
#import modules and style the notebook

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib
from matplotlib.colors import ListedColormap
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-whitegrid')

from sklearn.datasets import load_iris

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
In [3]:
iris = load_iris()  # load the dataset
X = iris['data']
y = iris['target']
In [4]:
np.random.seed(11)
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 22)

We work only with the training set in order to fit the PCA.

In [5]:
pca = PCA(n_components = 2) # The first 2 principal components

pca.fit(X_train)         # it fits PCA on the X training set
X_train_trs = pca.transform(X_train) # it then transforms the X training set
X_test_trs = pca.transform(X_test)   # and also the X test set
In [6]:
var1, var2 = pca.explained_variance_ratio_
print("1st PC preserves the {:.3f} % of variance".format(var1))
print("2nd PC preserves the {:.3f} % of variance.".format(var2))
1st PC preserves the 0.925 % of variance
2nd PC preserves the 0.053 % of variance.

So in this dataset, the first two principal components preserve almost the whole variance. We can expect not to lose so much information from the original 4 features to the new 2 "artificial" features created with the PCA.

Let's plot the PCA scatterplot. In this scatterplot, every instance of the dataset (with 4 features) is mapped to a new instance with 2 new features (the 1st PC and the 2nd PC) that is collocated on the plane.
The labels are expressed with 3 different simbols and colors.

In [7]:
csfont = {'fontname':'Comic Sans MS','fontsize':16}
markers = ('s', 'x', 'o')
colors = ('red', 'blue', 'lightgreen')
cmap = ListedColormap(colors[:len(np.unique(y_test))])

plt.figure(figsize=(10,8))
mapping = {0:'setosa', 1:'versicolor', 2:'virginica'}
for idx, cl in enumerate(np.unique(y_train)):
    plt.scatter(x = X_train_trs[y_train == cl, 0], y = X_train_trs[y_train == cl, 1],
               c = [cmap(idx)], marker = markers[idx], label = mapping[cl])
plt.xlabel('1st Principal Component', **csfont)
plt.ylabel('2nd Principal Component', **csfont)
plt.title('Training labels of PCA', **csfont)
plt.legend(loc = 4, prop = {'size' : 14});

Now we fit our KNN classifier with 10 neighbors (in this article our aim is not to find the best parameterization of the model, so we decide to choose a standard number of neighbors).

In [8]:
K_model = KNeighborsClassifier(n_neighbors = 10)
K_model.fit(X_train_trs, y_train)
y_hat = K_model.predict(X_test_trs)
round(accuracy_score(y_test, y_hat),3)
Out[8]:
0.933

The model seems very accurate even if we have reduced the dimensionality. Moreover we could plot the data in this way.

Let's see how the classification was made by the algorithm. Below we can see the classification plane done by the KNN Classifier: on the left are scattered the training instances, and on the right the test instances.

In [9]:
def plot_decision_regions(X_train, X_test, y_train, y_test, classifier, resolution = 0.02):
    
    global mapping
    global csfont
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y_train))])

    # plot the decision surface
    x1_min, x1_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
    x2_min, x2_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    
    fig, (ax1, ax2) = plt.subplots(1,2,figsize = (17,8))
    fig.suptitle('Classification plane done with {}'.format(classifier.__class__.__name__), fontsize = 28 )
    for ax in (ax1,ax2):
        ax.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
        ax.set_xlim(xx1.min(), xx1.max())
        ax.set_ylim(xx2.min(), xx2.max())
        ax.set_xlabel('1st Principal Component', **csfont)
        ax.set_ylabel('2nd Principal Component', **csfont)
        ax.tick_params(axis='both', which='major', labelsize=12)

    ax1.set_title('Training instances on Fitted Classification Plane', **csfont)
    for idx, cl in enumerate(np.unique(y_train)):
        ax1.scatter(x = X_train[y_train == cl, 0], y = X_train[y_train == cl, 1],
                        alpha=0.8, c = [cmap(idx)],
                        marker=markers[idx], label= mapping[cl] + ' training set')
        ax1.legend(loc = 1, prop={'size': 12})
    
    ax2.set_title('Test instances on Fitted Classification Plane', **csfont)
    for idx, cl in enumerate(np.unique(y_test)):
        ax2.scatter(x = X_test[y_test == cl, 0], y = X_test[y_test == cl, 1], 
                        alpha = 0.8, c = [cmap(idx)],
                        marker=markers[idx],
                        linewidth=1, edgecolor='k',
                        s=100, label = mapping[cl] + " test set" )
        ax2.legend(loc = 1, prop={'size': 12})
In [10]:
plot_decision_regions(X_train_trs, X_test_trs, y_train, y_test, K_model)

On the right we can see that two flowers virginica of the test set (green circles) have been incorrectly classified as versicolor by the KNN classifier (the region where they lie is purple).


In [ ]: