Principal Component Analysis (PCA) with Scikit-learn and Matplotlib

Principal Component Analysis is a dimension reduction method that transforms the $p$ predictors in $M$ new predictors, with $M < p$.
The new predictors are linear combinations of the old ones.
Suppose we have a linear model like:

$ y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{i,j} + e_i$

where $y$ is a $N \times 1$ vector.

In order to fit the model faster, or, in order to visualize a 2D or 3D (it's not possible if we have more than 3 predictors), we can reduce the number of predictors through this method, and obtain $M$ new predictors where the $m$ vector predictor ($N \times 1$) is of the form:

$ Z_m = \sum_{j=1}^{p} \theta_{m,j} X_j $ , where $ m \in M $

After we have found the new $ Z_1, Z_2, Z_3....Z_M $ , we regress $y$ on those new $M$ predictors.

The aim is to gain speed of the training process and the possibility to visualize data, at the cost of a bigger bias of the model. But we will see that if we choose wisely the $\theta$ coefficients, the bias cost can be negligible.

The way through we find the $\theta$ coefficients represents the Principal Component Analysis.

The $\theta$ coefficients are choosen in a way that the new predictors capture most of the information contained in the original predictors. We must reduce the dimension minimizing the loss of information of the original predictors. So the new predictors must preserve as much as possibile the variance of the old ones.
The 1st Principal Component can be thought as the "artificial" predictor (direction) the preserves the biggest fraction of the variance of original dimensional predictor space.
The 2nd Principal Component can be thought as the 2nd "artificial" predictor (direction) the preserves the second biggest fraction of the variance of original dimensional space of predictors, but with the constrain that it must be orthogonal with respect to the first.


We now analize the PCA through a simulation with Scikit-learn and then we will plot the results in order to better understand how predictors and response are associate between them.

In [1]:
#import modules and style the notebook

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.decomposition import PCA
import matplotlib
from mpl_toolkits import mplot3d
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-whitegrid')
In [2]:
%%HTML
<style>
div.prompt {display:none}
</style>
In [3]:
np.random.seed(18) # in order to repeat the simulation with the same values 

''' Let's say we have 4 original predictors '''

x1 = np.random.randn(1000) + 2 # 1000 instances of a standard normal + 2
x2 = np.random.randn(1000) + 2
x3 = np.random.normal(0, 6, size = 1000) # normal with mean = 0, variance = 6
x4 = np.random.normal(0, 6, size = 1000)

X = np.c_[x1, x2, x3, x4] # let's put those in the X matrix
In [4]:
y = x1 + x2 + x3 + x4 + np.random.randn(1000) # the original diffusion process 

We start with a LinearRegression with the original predictors:

In [5]:
# we split the set in training and test set; for this study is not necessary the k-fold cross validation methodology
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 22)                                                                                           
In [6]:
lr = LinearRegression()
In [7]:
model = lr.fit(X_train, y_train)
hat = model.predict(X_test)
res = y_test - hat
R2_train = round(model.score(X_train, y_train),3)
R2_test = round(model.score(X_test, y_test),3)
MSE_original = MSE(y_test, hat)
print("R^2 on training set:", R2_train,"\n","R^2 on test set:", R2_test)
R^2 on training set: 0.986 
 R^2 on test set: 0.987

As we could imagine, the $R^2$ is almost 1 because the diffusion process of our simulation is exactly a linear combination of those 4 random variables and the noise is a gaussian error with values not so big to play a role in the value of $y$.

But, even if our model is very simple and accurate, we can't plot our results, so we don't have an intuitive grasp of the association between predictors and response.

We can obtain it through a PCA.

In [8]:
pca = PCA(n_components = 0.90) # This module will compute the minimum number of dimensions required to preserve 
                                # the 90% of the training set's variance. It automatically takes care of centering the data.
pca.fit(X_train)         # it fits PCA on the X training set
X_train_trs = pca.transform(X_train) # it then transforms the X training set
X_test_trs = pca.transform(X_test)   # and also the X test set
In [9]:
print("PCA module has found {} components.". format(pca.n_components_))
PCA module has found 2 components.
In [10]:
var1, var2 = pca.explained_variance_ratio_
print("1st PC preserves the {:.3f} % of variance".format(var1))
print("2nd PC preserves the {:.3f} % of variance.".format(var2))
1st PC preserves the 0.510 % of variance
2nd PC preserves the 0.461 % of variance.

Instead we now fit a Principal Components Regression.

In [11]:
model = lr.fit(X_train_trs, y_train)  # Fit the new Linear regression, with the new components as predictors.
hat = model.predict(X_test_trs)       # y_hat
res = y_test - hat                    # residuals
MSE_pca = MSE(y_test, hat)
R2_train_pca = model.score(X_train_trs, y_train)
R2_test_pca = model.score(X_test_trs, y_test)
In [12]:
print("R^2_pca on training set:", round(R2_train_pca,3), "\n","R^2_pca on test set:", round(R2_test_pca,3))
R^2_pca on training set: 0.957 
 R^2_pca on test set: 0.963

$R^2$ has remained very elevated even if we have fit the model only with 2 predictors.
In fact, that was possible because those 2 new "artificial" predictors are the best 2 predictors possible. PCA is exactly the method that does this small "miracle".

So now we have 2 predictors and the response.
We can plot them in 2 ways.
The first is a scatter plot where the response is indicated with color and size (i prefer to use 2 paramaters and not only one because it renders a better graphic effect). Let's see below.

In [13]:
matplotlib.rcParams.update({'font.size': 18, 'font.family': 'sans serif'})

fig = plt.figure(figsize = (14,8)); 
ax = fig.add_subplot(1,1,1)
cmap = matplotlib.cm.get_cmap('Blues')

normalize = matplotlib.colors.Normalize(vmin=min(y_test), vmax=max(y_test))
ax.scatter(X_test_trs[:,0],X_test_trs[:,1], s = y_test * 30, c= y_test, cmap = cmap)
ax.set(xlabel = '1st Principal Component (on test set)', ylabel= '2nd Principal Component (on test set)', \
       title =  'PCA with 2 components \n Test target values indicated with color and size')

cax, _ = matplotlib.colorbar.make_axes(ax)
cbar = matplotlib.colorbar.ColorbarBase(cax, cmap=cmap , norm=normalize);

The second way is a 3D graph where we can plot the real response (scatter) and the regression plane, where the predicted responses lie.

In [14]:
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
In [15]:
xx, yy = np.meshgrid(X_test_trs[:,0], X_test_trs[:,1])
Z = model.intercept_ + xx * model.coef_[0] + yy * model.coef_[1]
fig1 = plt.figure(figsize = (12,10))
ax = fig1.gca(projection='3d')
cmap = matplotlib.cm.get_cmap('Blues')
ax.plot_surface(xx, yy, Z, color = 'grey', alpha = 0.1)
ax.scatter(X_test_trs[:,0], X_test_trs[:,1], y_test, s = 80, marker = '8', edgecolors= 'k', alpha = 0.5) #, c = y_test, cmap = 'Blues')
ax.set_xlabel('1st Principal Component', labelpad = 20)
ax.set_ylabel('2nd Principal Component',  labelpad = 20)
ax.set_zlabel('Responses and regression plane', labelpad = 10)
#ax.view_init(10, 95)
ax.view_init(30, 60)

So, the point is: if we have a new instance characterized by 4 features (the original features), we can immediatly see the association between that instance and the response. Indeed, the PCA method will transform that instance in a new instance with only two features (the best two-feature version) and then we can plot that feature in the graphs above with its associated response.

Finally, let's compare the mean squared error of the original Linear regression with the same metric of the PCA Linear Regression:

In [16]:
print('Test MSE original model:', round(MSE_original,3))
print('Test MSE PCA model:', round(MSE_pca,3))
Test MSE original model: 0.94
Test MSE PCA model: 2.603

As we said early, the possibility to plot data and see the association between them increases the bias of the model because we reduce the original dimensionality. But, generally, when we decrease the dimensionality of the model, we decrease also the variance of the model, so we can't say for sure if the PCA increases the MSE or not, compared to the original model (always remember the bias-variance trade-off). In our specific simulation, the bias component was more important than the variance component and so the MSE is increased.


In [ ]: