Solved – Skewness and Kurtosis in an Image

image processing

I am working with textures in an image. I have implemented algorithms to calculate the skewness and kurtosis in an image histogram. Great, delighted with that. I am using RGB as my color model. I know the equations and I understand what they and the algorithms are doing in code. But I have no idea what they actually represent in terms of visually in the image.

If I am wrong, please correct me but I would expect the mean to be some shade of blue. The standard deviation to be deviations of the mean, so like the greens and reds in the image would start to come in to the mix.

But what would skewness and kurtosis represent? My guess for skewness would be the more intense colours like the yellows and the reds as they would be to the far right of the mean. But as for kurtosis, I am stumped.

Alternatively, my understanding of the whole thing could be incorrect and I may be making no sense here what so ever. If so, point me in the right direction. I am aware this is not a 'programming' question but needless to say, help is required. I am asking here because

a) I would imagine there to be many people who have experience working with images in various programming languages.

b) I have already tried signal processing community. It was good, I understand that skewness is based around symmetric of the histogram as a result of posting there. And that Kurtosis is a high or low peak from the mean. But still, in visual terms. I don't quite understand.

Best Answer

To calculate descriptive statistics (such as Mean, Variance, skewness, Kurtosis, etc) for an image, first you need to get the histogram of the image. In case you image is not gray-scale, you need to work on all 3 different channels (R,G,B) separately. Say you have the histogram of your image in a channel, you have calculated the Skewness and Kurtosis, and now you want to analyze the results. Here is what they mean:

Skewness: is a measure of (lack of) symmetry [1]. For instance, if the skewness is negative, the histogram is negatively skewed. That means its left tail is longer or fatter than its right one (wiki:skewness). Therefore, the frequency over the darker intensities (closer to zero) is wider spread (less concentrated, but not necessarily less frequent than the right tail!). The positive skewness is the opposite.
Kurtosis: is best described in [2]:

is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the "peak" would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it close to zero. The only data values (observed or observable) that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore kurtosis measures outliers only; it measures nothing about the "peak."

So, although you might be able to somehow visually interpret skewness in a sense, none of them are meant to be visually interpretable. On the other hand, they are called descriptive statistics and it is so, because they describe the distribution function (the histogram in this case) and not directly the data whose distribution is being studied.

It is worth mentioning that despite their unintuitive numeric values, such parameters are extremely helpful when it comes to working with a large scale (big data) image repositories, for querying images based on their content (i.e. content base image retrieval), data mining on the large-size images, applying machine learning algorithms on real datasets, and the like.

Related Solutions

Solved – Using PCA for detecting similar regions in an image

It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)

Here I made up a very simple example myself, in Octave, with a unidimensional signal (y) of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3). I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.

N=200;
b=3;
delay=130; 
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50);  % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)];  % octave does not have  corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z);  % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1));  % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay

Deep Learning – Why Normalize Images by Subtracting Dataset’s Image Mean Instead of Current Image Mean?

Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or pixel as well if you want to normalize each feature value to a z-score.

The reason we do both of those things is because in the process of training our network, we're going to be multiplying (weights) and adding to (biases) these initial inputs in order to cause activations that we then backpropogate with the gradients to train the model.

We'd like in this process for each feature to have a similar range so that our gradients don't go out of control (and that we only need one global learning rate multiplier).

Another way you can think about it is deep learning networks traditionally share many parameters - if you didn't scale your inputs in a way that resulted in similarly-ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight w is a lot and to another it's too small.

You will see in some CNN models that per-image whitening is used, which is more along the lines of your thinking.

Best Answer

Related Solutions

Solved – Using PCA for detecting similar regions in an image

Deep Learning – Why Normalize Images by Subtracting Dataset’s Image Mean Instead of Current Image Mean?

Related Question