Solved – If softmax is used as an activation function for output layer, must the number of nodes in the last hidden layer equal the number of output nodes

neural networks

Let us assume that I have the following neural network architecture:

Input-Layer: 12 nodes
1st Hidden Layer: 9 nodes
2nd Hidden Layer: 6 nodes
Output Layer: 3 nodes

Can I use Softmax activation function on the output layer for the above architecture? If so, how? Because, in the Softmax formula how will I get the numerator properly if the number of nodes in the last hidden layer and the output layer are not same?

In the architecture above, I will get 6 weighted inputs $(x_i)$ at the output layer. Then the softmax output for $j$th output node is: $$\frac{e^{x_j}}{S}$$, where, $$S=\sum{e^{x_i}}$$

But this will only work if the range of $i$ and $j$ are same. That is only possible if the last hidden layer and the output layer have the same number of nodes. Have I understood this correctly?

Best Answer

Have you done multi-class logistic regression? Because the same "problem" arises. Logistic regression is a two-layer (no hidden layer) neural network. It can have as many inputs as you want, regardless of the number of classes of the target variable. So I will explain the answer in the context of multi-class logistic regression, and all you need to do to translate this to your problem is replace "input features" with "last hidden layer" in your network.

In logistic regression you have $D$ input features, $X_0, X_1, \dots, X_{D-1}.$ For convenience, I'm including $X_0$ as the "bias term", so $X_0=1$ all the time. I'm just doing this so I don't have to describe the bias separately. (Bias terms appear in hidden layers of NNs, so the same thing would apply.)

Suppose the target variable, $t,$ is a categorical variable with $k$ classes. Then we use a one-of-k encoding for $t.$ (That is, we represent it with $k$ binary variables, where one of the variable values is $1$ and the rest are $0.$) Then the non-activated output is,

$$ A = W^T X, $$ where $W$ is a $D \times k$ matrix representing the weight vectors for each class, and $X$ is a $D \times 1$ vector representing a single observation. Thus $A$ is a $k \times 1$ vector. The output, $Y,$ which represents our predicted probabilities is, $$ Y = \text{softmax}(A), $$ meaning $$ Y_i = \frac{e^{A_i}}{\sum\limits_{j=1}^{k} e^{A_j}}. $$ $Y_i$ is the probability that the $i$th binary variable of $t$ is one.

Notice that the dimensions of $X$ is irrelevant here, because $W$ converts a $D \times 1$ vector $X$ into a $k \times 1$ vector $A.$ Thus there is no hard constraint on the number of input features you can apply to multi-class logistic regression. This is exactly analogous to the last hidden layer in a multi-class classification neural network.

Related Question