I have a breast cancer data with 13 features. I instantiate PCA and put the training data through and transform down into 2 dimensions.
I get a data that is linearly separable which is interesting to me since I'm doing a binary logistic regression.
I'm writing an article and showing the data with a decision boundary is a good image to show that the model worked.
I train the logistic regression model on the 2-d data from the PCA. I plot the decision boundary using the intercept and coefficient and it does linearly separate the data.
My question is, What does plot of the 2-d data from the PCA and decision boundary mean? What can deduce from that?
Jupyter notebook to my work (It's the very last plot)
Edit: Plot with the target class
Plot with predict
on x_pca
Best Answer
What does the 2-d PCA data/plot mean?
The 2-d PCA data/plot represent two "compound features" which PCA created to capture as much of the variance in your original 13 features as possible.
Assuming your 13 features are linearly independent (e.g. one feature is not just another feature times 2, for every row in your data), it would take 13 dimensions to capture 100% of the variation in your 13 raw features. However, often PCA can capture e.g. 98% of the variation in your data in just a few PCA dimensions. To see how much of the variance is explained by each PCA dimension for your problem, print
x_pca.explained_variance_ratio_
after youfit()
yourx_pca
object.When PCA can capture a large amount of the variance of your features in just 2 dimensions, that's especially convenient because then you can plot those 2 PCA dimensions as you have, and know that any groupings which show up on the 2-d plot correspond to natural groupings in your 13-dimensional data.
What does the decision boundary mean?
The decision boundary in your code is a prediction of your
target
variable, using as features (independent variables) the first two PCA dimensions of your 13-dimension original feature set.Why is your decision boundary not in the obvious gap?
Remember the PCA dimensions were formed just based on your 13 independent variables, without looking at your
target
. The decision boundary is not a decision boundary between PCA clusters, it's a decision boundary using the PCA dimensions to predicttarget
.So the fact that the decision boundary is not totally between the clusters means PCA's first two dimensions of your 13 features do a good job of separating your
target
classes, but not a perfect job.How to improve your plot
What you really care about is
target
, right? So in your plot, don't plot all the points as red. Color them bytarget
class. Then you will have a plot that shows you how well PCA and the information in your original features (represented by distance/space between clusters on your plot) distinguishes betweentarget
classes (which would be colors on the new plot).