Solved – What does the input matrix for deep learning architechture look like

deep learning

I am somewhat new to deep learning and trying to understand the different architectures. There are a couple things that confuse me greatly with regards to setting up the input:

i) I believe the input data is usually vectorized or flattened; how then is the local 2D structure (in the case of an image) or 3D structure (for video data) accounted for.

ii) how does one arrange the input matrix — there are patches within a single image that are used as well as the entire training set.
For instance, say we use a 100 images from the MNIST dataset as a training set. There are 10 classes and let us say 10 images for each class. Each image has a 28 x 28 pixel dimension.

Is the input data set then organized as:

— vectorize the image — 28 x 28 = 784
—So for 100 images will we have a 784 x 100 matrix?
—What about if we train on several 10 x 10 patches randomly sampled from each image?

I understand there are different architectures. Explaining the input for any one of them will help me greatly.

I am reading this paper for instance: http://ai.stanford.edu/~wzou/cvpr_LeZouYeungNg11.pdf
I don't understand the inputs and therefore the convolution fully.

Best Answer

  1. Yes, flattening "hides" local structure, so you don't really want to do it. And, since convolutional methods, for example, depend on such structure, flattening is not used with them. This gives you a tensor of 3rd order (of shape $N \times w \times h$) instead of a matrix (of shape $N \times wh$): you can think of tensors as of general multidimensional arrays, then matrix is 2-dimensional array (and a special case of tensor), and a n-dimensional array is tensor of nth order.

    It turned out, it's so convenient to use tensors to operate on such data, that theano makes heavy use of them.

  2. Cool thing about tensor is that they're of arbitrary dimension. For example, think of working with images. You have $N$ samples, each of $c$ channels (i.e., R, G, B, though other representation seem more favorable), each of $w \times h$ pixels in size. You can represent such dataset as 4-tensor of shape $N \times c \times w \times h$.

    It's even more useful if you're familiar with ConvNet's details and the notion of feature maps, since intermediate result is a 4-tensor of shape $N \times F \times w \times h$ where $F$ is number of feature maps.

Related Question