Solved – How to apply PCA on 3 dimensional image data in python

classificationdimensionality reductionimage processingpcapython

I have dataset containing colored images of cancerous and non-cancerous tissue cells. The image dimensions are 50x50x3, and I have a total of 280,000 images. I want to apply PCA to it in order to reduce the dimensions.

What are the steps that I would take in order to apply PCA to this dataset. I currently have the image paths and the target variables (cancerous/non-cancerous) stored in a dataframe.

The way I thought of approaching it would be to extract the image using im.read() from skimage, the flatten that image so that it would change from a shape of (50,50,3) to (7500,1), then I would append it to a numpy array so that my final numpy array would be 280,000 x 7500, where 280,000 is the total number of images I have.

After that I proceed to apply PCA.

My questions are:

Am I going about applying PCA in the correct way
Does flattening the 3-dimensional color space and placing it in a single vector make sense?

If the above method is not optimal, then what are the steps that I need to take to apply PCA without changing my image to greyscale?

My aim is to apply a Support Vector Machine to classify these images, after reducing the number of dimensions that they have.

Best Answer

In general, your approach may work, and it might even give you something that works somewhat well. However, I would strongly advise against it, or only use something like this as a first step to just get a feel for the problem.

Think about it this way: If you just shift one of the images one pixel to the left, how much would the vector representing that image change? How well could a PCA identify that these two images are in fact the same image, except for this 1-pixel shift.

It is better to use an approach that somewhat shift-invariant (and if possible rotation-invariant) . Here are some ideas:

You could use PCA to reduce the color space. Often the full 3D RGB space is not required. Instead of using the PCA on all pixels of the images, collect all pixels as individual 3D vectors. Then run the PCA on those. The resulting factors tell you which colors are actually representative of your images. However, you would get at best a reduction of the dataset by 1/3rd. In that case, you are reducing to grayscale, but you are retaining as much information as possible.
Use a method similar to one used by convolutional networks. Split each image into small (overlapping) patches of $K\times K$ pixels. Run the PCA on those patches. The resulting factors then represent typical features found in your image and are much more informative than just running a PCA on the complete image. Experiment with the size of the patches and the amount of overlap to see what gives you good result. If you know, for example, how a cancerous region looks, you could look at the resulting factors to see if any of those represent something you might recognize. Or you can drop patches that you recognize to be meaningless (e.g. patches which contain mostly uniform areas etc).
You can test if the patches work better, if you run them on independent colors (seperate patches for each color, with different component structure), of if you combine the colors first.
Mix, combine and stack these methods. If you have found a good size and overlap of the patches, but you have not reduced your data enough, then reduce the data using those patches. Because these patches represent areas of your images, you can still interpret them as 2D (or 3D if you have separate patches for each color) data. Repeat the process and create patches of patches. At this point, you are essentially building some form of convolutional neural network.
Although it might seem counterintuitive, in many cases it is helpful to first blow up your dataset (i.e. generate artificial data based on the data you have). The images you have may be very clean, all from the same angle, centered around the possible cancerous region etc. This may or may not represent the actual situation where you later want to use your data. If it doesn't, then you will not train the SVM (or the PCA) well for the task at hand. Generate additional images by adding noise, shifting them, rotating them a little etc. Then run the PCA and the SVM on increased dataset. This can greatly improve the final classifier.
If you want to get one step further, you should look at more powerful techniques of dimensionality reduction. A PCA is always computing a linear reduction. A better method is auto-encoder networks, which can be seen as a non-linear generalization of a PCA. There are also convolutional versions of auto-encoder networks, that give you the shift-invariance that you usually need. Also have a look at denoising auto-encoders, because these perform much better than naive auto-encoders in many cases. You can directly feed the (encoded) output from an auto-encoder to a SVM for classification. Or you use the auto-encoder in combination with a classical neural network, which essentially is a method for building deep neural networks.

Related Solutions

PCA – How to Think of Reduced Dimensions in PCA on Facial Images (Eigenfaces)

Just a hint, after reading your comment. Each image (face) is represented as a stacked vector of length $N$. The different faces make up a dataset stored in a matrix $X$ of size $K\times N$. You might be confused about the fact that you use the PCA to obtain a set of eigenvectors (eigenfaces) $I = \{u_1, u_2, \ldots, u_D\}$ of the covariance matrix $X^TX$, where each $u_i \in \mathbb{R}^{N}$. You don't reduce the number of pixels used to represent a face, but rather you find a small number of eigenfaces that span a space which suitably represents your faces. The eigenfaces still live in the original space though (they have the same number of pixels as the original faces).

The idea is, that you use the obtained eigenfaces as a sort of archetypes that can be used to perform face detection.

Also, purely in terms of storage costs, imagine you have to keep an album of $K$ faces, each composed of $N$ pixels. Instead of keeping all the $K$ faces, you just keep $D$ eigenfaces, where $D \ll K$, together with the component scores and you can recreate any face (with a certain loss in precision).

Solved – statistical method for spatial correlation between images

Most simplest way how to solve this in two images is extract the values from both rasters and do correlation. I am not sure if this solution will fit to your spacific case. In what "format" do you have the images? (greyscale, RGB, size, resolution...). Please give more specific details.

Two rasters in R for demonstration:

enter image description here

Values for picture A:

x <- c(1.0,1.0,1.0,1.0,0.5,0.5,0.0,0.0,0.5,0.5,
       2.0,2.0,1.5,1.5,1.0,1.0,0.5,1.0,1.0,1.0,
       2.5,2.0,2.0,2.0,2.0,1.0,1.0,1.5,2.0,2.0,
       2.5,3.0,3.0,3.0,2.5,2.0,2.0,2.0,2.5,2.5,
       2.5,3.5,4.0,3.5,2.5,2.0,2.5,3.0,3.0,3.5,
       2.5,3.5,3.5,2.5,2.0,2.5,3.0,3.5,4.0,3.5,
       2.5,3.5,3.5,3.0,3.5,4.0,4.0,4.0,3.5,2.5,
       2.5,3.5,4.0,4.0,3.5,3.5,3.0,3.0,2.5,2.0,
       2.5,3.5,3.5,3.0,2.5,2.5,2.0,2.0,2.0,1.5,
       2.0,3.0,2.5,2.0,2.0,1.5,1.5,1.5,1.0,1.0)

Values for picture B:

y <- c(rep(1, times = 10),
       rep(2, times = 6), 1, rep(2, times = 3),
       rep(2, times = 10),
       rep(3, times = 4), rep(2, times = 4), 3,3,
       3,4,4,3,2,rep(3, times = 4), 4,
       3,4,rep(3, times = 5), rep(4, times = 3),
       3,4,3,3,3,4,4,4,3,3,
       3, rep(4, times = 4), rep(3, times=4), 2,
       3,3,4,3,3,3,rep(2, times = 4),
       2,3,3,3,rep(2, times = 6))

Creation of arrays -> conversion of arrays into rasters

x_array<-array(x, dim=c(10,10))
y_array<-array(y, dim=c(10,10))
x_raster<-raster(x_array)
y_raster<-raster(y_array)

Setting color palette and plotting...

colors_x <- c("#fff7f3","#fde0dd","#fcc5c0","#fa9fb5","#f768a1","#dd3497",
              "#ae017e","#7a0177","#49006a")
colors_y <- c("#fff7f3","#fcc5c0","#f768a1","#ae017e")

par(mfrow=c(1,2))
plot(x_raster, col = colors_x)
plot(y_raster, col = colors_y)

...and here is the correlation

cor(x,y)
    Pearson's product-moment correlation

data:  x and y
t = 21.7031, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8686333 0.9385211
sample estimates:
      cor 
0.9098219

Maybe there is more specialized solution to this but I think that this solution is pretty robust, simple and straightforward.

Link worth of interest: (for ImageJ) http://imagej.nih.gov/ij/plugins/intracell/index.html

Best Answer

Related Solutions

PCA – How to Think of Reduced Dimensions in PCA on Facial Images (Eigenfaces)

Solved – statistical method for spatial correlation between images

Related Question