I'm considering using ReLU or convolutional deep learning network to classify black and white 8.5"x11" images (with some fine details). Most examples of DNN I saw tested on the MNIST images which are 28×28 pixels. I figured I could probably reduce the images to 320×414 pixels and still be recognizable for my classification needs; further reduction may be risky as even human being may have hard time telling the details. But even at this resolution, there will be 132480 pixels and so the network input would be a vector of 32-bit floats of that many element. Will ReLU or convolutional network handle such large inputs? What are the method to reduce the input size?
Solved – Can neural network classify large images
deep learningneural networks
Related Solutions
The short answer to this is that if you train a neural network to recognize MNIST images, then it will work well for precisely that (i.e., recognize more MNIST images).
Now, it seems a little disappointing that we cannot use our awesome neural network that has probably an accuracy near 95% or even 99%, to recognize new handwritten digits. Who wants to be able to recognize only images in the MNIST dataset?
So, after researching a little bit more, the general consensus seems to be that if you want your neural network to be able to recognize your own handwritten digits, you have two options:
1) include enough of your own digits into the training set, along with the MNIST ones.
or
2) pre-process your own handwritten digits so they resemble engough to MNIST digits, thus fitting what the neural network has been trained for.
We could even generalize this not only for MNIST but for any domain. Either you include enough variations in your training set (enrich your training set), or you choose one data representation and pre-process any other variation to match the one you chose (pre-process your input).
I found this article precisely about how to do pre-processing to match MNIST data representation.
I personally cannot help but feel that pre-processing is kind of "cheating", because the whole point of neural networks is not having to resort to this "algorithmic" type of solutions. It would be nicer to be able to train a neural network to be able to generalize enough so any data representation (within the limits of a particular domain, like handwritten digits) will work. But it seems this are the current state-of-the-art approaches, and for some domains you'll have to go one way or the other to get good results.
EDIT: as pointed out by a StackOverflow user in the comments section of this question, this can be seen as a "Domain Adaptation" problem. I'm currently learning about Domain Adaptation techniques, in particular "Domain-Adversarial training of neural networks" (google this for several papers on the subject). In particular I'm looking at this GitHub repo that implements a Domain-Adversarial example using Tensorflow. I added this so others can be pointed in the same direction.
Best Answer
There have been convolution networks for videos of $224 \times 224 \times 10$ (1), so yes its possible.
I would strongly suggest to reduce the image size as much as possible, and at the same time use non-fully connected layers in the beginning, reducing the dimensionality of your optimisation problem.
Another approach that you could try is to use a sliding window as input instead of the whole image. This way you could take the features of the first layers of any pretrained ImageNet network, that would significantly decrease your training time. In case you are using Torch7 you can find them here (2).
In both cases, in order to train such convolutional nets you will need a lot of computational power and a (some) very good GPU(s).