This is a difficult problem!
However the Computer Vision community has developed some pretty good solutions.
I'm not a Computer Vision expert, but some suggestions:
1) Any sort of local similarity/difference metric, e.g. norm(A(:)-B(:)), will be sensitive to noise/scale/rotation/illumination, etc.
So my first suggestion would be to use a robust similarity metric. A simple and robust one would be the "Pyramid Match Kernel":
Grauman and Darrell (2005)
The pyramid match kernel: Discriminative classification with sets of image features
http://scholar.google.com/scholar?cluster=17504420260500492874&hl=en&as_sdt=0,44
This is technically a metric on point sets (or histograms thereof) . . . but can work pretty well on images directly if you take the image as the "level 0 histogram" (see ref above).
2) If that doesn't work by itself, you can use it in its original framework: As a distance metric between "feature vectors" in a high-dimensional but sparse "feature space" derived from the image. There is a large literature on feature detectors, and I haven't ever used them . . . but this looks like a good bet:
OpenSURF (including Image Warp)
by Dirk-Jan Kroon (July 2010) SURF (Speeded Up Robust Features) image feature point detection / matching, as in SIFT
http://www.mathworks.com/matlabcentral/fileexchange/28300-opensurf-including-image-warp
The presentation here gives a brief overview of these approaches:
Scalable Image Recognition and Retrieval
http://www.krellinst.org/doecsgf/conf/2007/pres/grauman.shtml
Hope that helps!
--Matt Wolinsky
Best Answer