Solved – How to apply PCA on 3 dimensional image data in python

classificationdimensionality reductionimage processingpcapython

I have dataset containing colored images of cancerous and non-cancerous tissue cells. The image dimensions are 50x50x3, and I have a total of 280,000 images. I want to apply PCA to it in order to reduce the dimensions.

What are the steps that I would take in order to apply PCA to this dataset. I currently have the image paths and the target variables (cancerous/non-cancerous) stored in a dataframe.

The way I thought of approaching it would be to extract the image using im.read() from skimage, the flatten that image so that it would change from a shape of (50,50,3) to (7500,1), then I would append it to a numpy array so that my final numpy array would be 280,000 x 7500, where 280,000 is the total number of images I have.

After that I proceed to apply PCA.

My questions are:

  • Am I going about applying PCA in the correct way
  • Does flattening the 3-dimensional color space and placing it in a single vector make sense?

If the above method is not optimal, then what are the steps that I need to take to apply PCA without changing my image to greyscale?

My aim is to apply a Support Vector Machine to classify these images, after reducing the number of dimensions that they have.

Best Answer

In general, your approach may work, and it might even give you something that works somewhat well. However, I would strongly advise against it, or only use something like this as a first step to just get a feel for the problem.

Think about it this way: If you just shift one of the images one pixel to the left, how much would the vector representing that image change? How well could a PCA identify that these two images are in fact the same image, except for this 1-pixel shift.

It is better to use an approach that somewhat shift-invariant (and if possible rotation-invariant) . Here are some ideas:

  • You could use PCA to reduce the color space. Often the full 3D RGB space is not required. Instead of using the PCA on all pixels of the images, collect all pixels as individual 3D vectors. Then run the PCA on those. The resulting factors tell you which colors are actually representative of your images. However, you would get at best a reduction of the dataset by 1/3rd. In that case, you are reducing to grayscale, but you are retaining as much information as possible.

  • Use a method similar to one used by convolutional networks. Split each image into small (overlapping) patches of $K\times K$ pixels. Run the PCA on those patches. The resulting factors then represent typical features found in your image and are much more informative than just running a PCA on the complete image. Experiment with the size of the patches and the amount of overlap to see what gives you good result. If you know, for example, how a cancerous region looks, you could look at the resulting factors to see if any of those represent something you might recognize. Or you can drop patches that you recognize to be meaningless (e.g. patches which contain mostly uniform areas etc).

  • You can test if the patches work better, if you run them on independent colors (seperate patches for each color, with different component structure), of if you combine the colors first.

  • Mix, combine and stack these methods. If you have found a good size and overlap of the patches, but you have not reduced your data enough, then reduce the data using those patches. Because these patches represent areas of your images, you can still interpret them as 2D (or 3D if you have separate patches for each color) data. Repeat the process and create patches of patches. At this point, you are essentially building some form of convolutional neural network.

  • Although it might seem counterintuitive, in many cases it is helpful to first blow up your dataset (i.e. generate artificial data based on the data you have). The images you have may be very clean, all from the same angle, centered around the possible cancerous region etc. This may or may not represent the actual situation where you later want to use your data. If it doesn't, then you will not train the SVM (or the PCA) well for the task at hand. Generate additional images by adding noise, shifting them, rotating them a little etc. Then run the PCA and the SVM on increased dataset. This can greatly improve the final classifier.

  • If you want to get one step further, you should look at more powerful techniques of dimensionality reduction. A PCA is always computing a linear reduction. A better method is auto-encoder networks, which can be seen as a non-linear generalization of a PCA. There are also convolutional versions of auto-encoder networks, that give you the shift-invariance that you usually need. Also have a look at denoising auto-encoders, because these perform much better than naive auto-encoders in many cases. You can directly feed the (encoded) output from an auto-encoder to a SVM for classification. Or you use the auto-encoder in combination with a classical neural network, which essentially is a method for building deep neural networks.