Solved – How to make sense of this PCA plot with logistic regression decision boundary (breast cancer data)

logisticmachine learningpcascikit learn

I have a breast cancer data with 13 features. I instantiate PCA and put the training data through and transform down into 2 dimensions.

I get a data that is linearly separable which is interesting to me since I'm doing a binary logistic regression.

I'm writing an article and showing the data with a decision boundary is a good image to show that the model worked.

I train the logistic regression model on the 2-d data from the PCA. I plot the decision boundary using the intercept and coefficient and it does linearly separate the data.

enter image description here

My question is, What does plot of the 2-d data from the PCA and decision boundary mean? What can deduce from that?

Jupyter notebook to my work (It's the very last plot)


Edit: Plot with the target class

enter image description here

Plot with predict on x_pca

enter image description here

Best Answer

What does the 2-d PCA data/plot mean?

The 2-d PCA data/plot represent two "compound features" which PCA created to capture as much of the variance in your original 13 features as possible.

Assuming your 13 features are linearly independent (e.g. one feature is not just another feature times 2, for every row in your data), it would take 13 dimensions to capture 100% of the variation in your 13 raw features. However, often PCA can capture e.g. 98% of the variation in your data in just a few PCA dimensions. To see how much of the variance is explained by each PCA dimension for your problem, print x_pca.explained_variance_ratio_ after you fit() your x_pca object.

When PCA can capture a large amount of the variance of your features in just 2 dimensions, that's especially convenient because then you can plot those 2 PCA dimensions as you have, and know that any groupings which show up on the 2-d plot correspond to natural groupings in your 13-dimensional data.

What does the decision boundary mean?

The decision boundary in your code is a prediction of your target variable, using as features (independent variables) the first two PCA dimensions of your 13-dimension original feature set.

Why is your decision boundary not in the obvious gap?

Remember the PCA dimensions were formed just based on your 13 independent variables, without looking at your target. The decision boundary is not a decision boundary between PCA clusters, it's a decision boundary using the PCA dimensions to predict target.

So the fact that the decision boundary is not totally between the clusters means PCA's first two dimensions of your 13 features do a good job of separating your target classes, but not a perfect job.

How to improve your plot

What you really care about is target, right? So in your plot, don't plot all the points as red. Color them by target class. Then you will have a plot that shows you how well PCA and the information in your original features (represented by distance/space between clusters on your plot) distinguishes between target classes (which would be colors on the new plot).