Solved – mini batch matrix form of neural network

deep learningmachine learningneural networks

I am studying neural network. I read a lot about mini-batch gradient descent. So now, I was implementing it in TensorFlow.

But I have some questions regarding the matrix formulation of the problem/vectorization.

Notation

First some notation
$w_{ij}^l$ is the weight on the connection connecting node $j$ in layer $l$ to node $i$ in layer $l+1$.

If we have $4$ nodes and $3$ features we could have:

$W^{(1)} = \begin{bmatrix}
w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\
w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)} \\
w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)} \\
w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)}
\end{bmatrix}$

At the same time, suppose we have a batch size of $5$, then the input matrix might be

$X = \begin{bmatrix}
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23} \\
x_{31} & x_{32} & x_{33} \\
x_{41} & x_{42} & x_{43} \\
x_{51} & x_{52} & x_{53}
\end{bmatrix}$

where $x_{ij}$ is the input in sample $i$ of feature $j$.

What I don't understand

Everywhere I go I see $f(W^TX +\mathbb{b})$ as the formula. However this clearly doesn't make sense here, as we would have incompatible shapes!! $W$ has shapes $(\text{number nodes}, \text{number features})$, while $X$ has shape $($ batch size, number features $)$.

Ideally we would like, for each sample, to have something like this (I only write the first two columns). But how do we get there? Is this the correct way?

$A^{(1)} = \begin{bmatrix}
w_{11}^{(1)}x_{11} + w_{12}^{(1)}x_{12} + w_{13}^{(1)}x_{13} & w_{11}^{(1)}x_{21} + w_{12}^{(1)}x_{22} + w_{13}^{(1)}x_{23}\\
w_{21}^{(1)} x_{11} + w_{22}^{(1)}x_{12} + w_{23}^{(1)}x_{13} & w_{11}^{(1)}x_{21} + w_{12}^{(1)}x_{22} + w_{13}^{(1)}x_{23}\\
w_{31}^{(1)}x_{11} + w_{32}^{(1)}x_{12} + w_{33}^{(1)}x_{13} & w_{31}^{(1)}x_{31} + w_{32}^{(1)}x_{32} + w_{33}^{(1)}x_{33}\\
w_{41}^{(1)}x_{11} + w_{42}^{(1)}x_{12} + w_{43}^{(1)}x_{13} & w_{41}^{(1)}x_{41} + w_{42}^{(1)}x_{42} + w_{43}^{(1)}x_{43}
\end{bmatrix}$

Best Answer

I don't see the problem. Here's a neural net with three layers: $$ \begin{array}{c} y = \beta_0 + \mathbf{V}_1\beta \\ \mathbf{V}_1 = a(\gamma_1 + \mathbf{V}_2\Gamma_1)\\ \mathbf{V}_2 = a(\gamma_2 + \mathbf{V}_3\Gamma_2)\\ \mathbf{V}_3 = a(\gamma_3 + \mathbf{X}\Gamma_3) \end{array} $$

Where $\mathbf{X}$ is your data.

Say that $\mathbf{X}$ is $N \times p$. That means that $\Gamma_3$ is $p \times p_3$, $\mathbf{V}_3$ is therefore $N \times p_3$ and $\Gamma_2$ is $p_3 \times p_2$, $\mathbf{V}_2$ is therefore $N \times p_2$ and $\Gamma_1$ is $p_2 \times p_1$, and $\beta$ is $p_1 \times 1$ (assuming a univariate response).

For a minibatch, just swap out $N$ with $B<N$. It doesn't change the dimensions of the weights, just that of the derived variables/hidden layers $\mathbf{V}$

Your problem is that you were thinking of $X$ as row-oriented vs column oriented, I think. That's why your matrices weren't conformable. But minibatch has nothing to do with that -- minibatch chops off observations, not features. Dropout, on the other hand, chops off features. In so doing it propagates the deleted columns upward/downward on the forward backward passes.