I'm writing a computer program to automatically detect black noisy borders on scanned images and crop them off.
My algorithm is based on 2 variables: gray mean value (of the pixels in a rows/columns) and position (of a row/column in the image).
Gray Mean Value
Images are in gray scale: this means that any pixel has a gray value in the range 0 (black), 255 (white).
For each row/column of pixels, I estimate the mean gray value for all the pixels in that row/column. If the result is dark, then the current row/column is part of the border to cut off.
Position
The position is the distance in pixel of a row/column from the top left corner of the image.
Take a look at the following images for a better idea.
Thumbnail of a scanned image:
Resulting chart:
It is very easy, by looking at the chart, to estimate where the cropping points are because of the following rule: the most of the samples are in a white narrow range (150-200) which is the actual paper, then in the tails there is a quick change to dark values.
Those quick changes are the cropping points.
(Notice also that in the really end of tails there can still be white for a few pixel, but this seldom happens.)
I want to do it automatically. Are there any statistics which can help me out?
PS: I'm a computer engineer, I've studied some statistics but… too many years ago!
In the best case scenario the code should work with any scanned image affected by the black border problem, but, getting real, I'll be satisfied to make it work with these samples:
https://docs.google.com/folder/d/0B8ubCWBwsuOON3d1VVo4Z1AxWDA/edit
Best Answer
One risk of your current idea may be how it'll react to a page with a dark photo or an image. Have you thought about that? Would it get cropped?
Another corner-case I can think of is slant-scanned pages. Is that a possibility? Then the borders won't row / column aligned.
Have a look at this article:
http://www.dfki.uni-kl.de/~shafait/papers/Shafait-projection-based-cleanup-INMIC09.pdf
"A Simple and Effective Approach for Border Noise Removal from Document Images"
I think they've covered a lot of the same goals that you have.