I'm not sure if this exactly answers your question, but from what I understand the reason you don't see people pretraining (I mean this in an unsupervised pretraining sense) conv nets is because there have been various innovations in purely supervised training that have rendered unsupervised pretraining unnecessary (for now, who knows what problems and issues the future will hold?).
One of the main innovations was moving away from sigmoidal (sigmoid, tanh) activation units, which can saturate/have regions of near flat curvature and thus very little gradient gets propagated backwards, so learning is incredibly slow if not completely halted for all practical intents and purposes. The Glorot, Bordes and Bengio article Deep Sparse Rectifier Neural Networks used rectified linear units (ReLUs) as activation functions in lieu of the traditional sigmoidal units. The ReLUs have the following form: $f(x) = \max(0, x)$. Notice that they are unbounded and for the positive part, has constant gradient 1.
The Glorot, Bordes and Bengio article used ReLUs for multilayer perceptrons and not Conv Nets. A previous article What is the best Multi-Stage Architecture for Object Recognition by Jarret and others from Yann LeCun's NYU group used rectifying nonlinearities but for the sigmoidal units, so they had activation functions of the form $f(x) = |\tanh(x)|$, etc. Both articles observed that using rectifying nonlinearities seems to close much of the gap between purely supervised methods and unsupervised pretrained methods.
Another innovation is that we have figured out much better initializations for deep networks. Using the idea of standardizing variance across the layers of a network, good rules of thumb have been established over the years. One of the first, most popular ones was by Glorot and Bengio Understanding the Difficulty of Training Deep Feedforward Networks which provided a way to initialize deep nets under a linear activation hypothesis and later on Delving Deep Into Rectifiers by a group of Microsoft Research team members which modify the Glorot and Bengio weight initialization to account for the rectifying nonlinearities. The weight initialization is a big deal for extremely deep nets. For a 30 layer conv net, the MSR weight initialization performed much better than the Glorot weight initialization. Keep in mind that the Glorot paper came out in 2010 and the MSR paper came out in 2015.
I am not sure if the ImageNet Classification with Deep Convolutional Neural Networks paper by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton was the first to use ReLUs for conv nets, but it had the biggest impact. In this paper we see that ReLUs for conv nets speeds up learning, as evidenced by one of their CIFAR-10 graphs which shows that ReLU conv nets can achieve lower training error rates faster than non-ReLU conv nets. These ReLUs do not suffer from the vanishing gradient/saturating sigmoidal issues and can be used to train much deeper nets. One of the other big innovations has been the use of Dropout training, a stochastic noise injection or model averaging technique (depending on your point of view) which allows us to train deeper, bigger neural networks longer without overfitting as much.
And the conv net innovation continued at a blistering pace, nearly all of the methods using ReLUs (or some modification like PReLUs from Microsoft Research), Dropout, and purely supervised training (SGD + Momentum, possibly some adaptive learning rate techniques like RMSProp or ADAGrad).
So as of now, a lot of the top performing conv nets seem to be of a purely supervised nature. That's not to say that unsupervised pretraining or using unsupervised techniques may not be important in the future. But some incredibly deep conv nets have been trained, have matched or surpassed human level performance on very rich datasets, just using supervised training. In fact I believe the latest Microsoft Research submission to the ImageNet 2015 contest had 150 layers. That is not a typo. 150.
If you want to use unsupervised pretraining for conv nets, I think you would be best finding a task where "standard" supervised training of conv nets doesn't perform so well and try unsupervised pretraining.
Unlike natural language modeling, it seems to be hard to find an unsupervised task that helps a corresponding supervised task when it comes to image data. But if you look around the Internet enough, you see some of the pioneers of deep learning (Yoshua Bengio, Yann LeCun to name a few) talk about how important they think unsupervised learning is and will be.
The answer of @ik_vision describes how to estimate the memory space needed for storing the weights, but you also need to store the intermediate activations, and especially for convolutional networks working with 3D data, this is the main part of the memory needed.
To analyze your example:
- Input needs 1000 elements
- After layers 1-4 layer you have 100 elements, 400 in total
- After final layer you have 10 elements
In total for 1 sample you need 1410 elements for the forward pass. Except for the input, you also need a gradient information about each of them for backward pass, that is 410 more, totaling 1820 elements per sample. Multiply by the batch size to get 465 920.
I said "elements", because the size required per element depends on the data type used. For single precision float32
it is 4B and the total memory needed to store the data blobs will be around 1.8MB.
Best Answer
There are a number of reasons training might be taking a long time, but the first reason that comes to mind is that you haven't set an appropriate learning rate. If your learning rate is too high, instead of descending towards a minimum, your gradient path will bounce around uncontrollably/erratically, and this could happen indefinitely. If your learning rate is too low (I've seen this happen less frequently in practice), your model will take a longer time to reach the minimum, but it will eventually arrive. I'm not familiar with the particular libraries you've referenced, so unfortunately I can't offer library-specific details.
There is a pretty good general answer (with references) on Stack Overflow about setting a good learning rate in neural networks. See the link below.
https://stackoverflow.com/questions/11414374/neural-network-learning-rate-and-batch-weight-update