Solved – Auto crop black borders from a scanned image by making stats about gray values

feature-engineeringimage processingsignal detectionspatial

I'm writing a computer program to automatically detect black noisy borders on scanned images and crop them off.
My algorithm is based on 2 variables: gray mean value (of the pixels in a rows/columns) and position (of a row/column in the image).

Gray Mean Value

Images are in gray scale: this means that any pixel has a gray value in the range 0 (black), 255 (white).
For each row/column of pixels, I estimate the mean gray value for all the pixels in that row/column. If the result is dark, then the current row/column is part of the border to cut off.

Position

The position is the distance in pixel of a row/column from the top left corner of the image.

Take a look at the following images for a better idea.
Thumbnail of a scanned image:

https://docs.google.com/file/d/0B8ubCWBwsuOOaVh6dVg4M19tajA/edit

Resulting chart:

https://docs.google.com/file/d/0B8ubCWBwsuOOYm1EaGpQb08tanc/edit

It is very easy, by looking at the chart, to estimate where the cropping points are because of the following rule: the most of the samples are in a white narrow range (150-200) which is the actual paper, then in the tails there is a quick change to dark values.

Those quick changes are the cropping points.
(Notice also that in the really end of tails there can still be white for a few pixel, but this seldom happens.)

I want to do it automatically. Are there any statistics which can help me out?
PS: I'm a computer engineer, I've studied some statistics but… too many years ago!

In the best case scenario the code should work with any scanned image affected by the black border problem, but, getting real, I'll be satisfied to make it work with these samples:
https://docs.google.com/folder/d/0B8ubCWBwsuOON3d1VVo4Z1AxWDA/edit

Best Answer

One risk of your current idea may be how it'll react to a page with a dark photo or an image. Have you thought about that? Would it get cropped?

Another corner-case I can think of is slant-scanned pages. Is that a possibility? Then the borders won't row / column aligned.

Have a look at this article:

http://www.dfki.uni-kl.de/~shafait/papers/Shafait-projection-based-cleanup-INMIC09.pdf

"A Simple and Effective Approach for Border Noise Removal from Document Images"

I think they've covered a lot of the same goals that you have.

Related Question