What the output of your neurons can be depends on the objective function you use and the activation function for the output neurons. For example, if you use sum of squared errors (regression), then one can prove that the output of the network is conditional average of the target data conditioned on the input. With equations,
$$y_{k}\left(\mathbf{x},\mathbf{w}\right) = \int t_{k} p(t_{k}|\mathbf{x})dt_{k}$$
where $k$ is the indicator for the neuron, $x$ is the input vector, $t$ is the target vector and $y(x,w)$ is the mapping carried out by the network.
If you use the cross entropy error function with sigmoidal output units (classification), then the output of each neuron is the probability that the sample corresponds to the class encoded by the neuron. A brief discussion and derivation of this result can be found here.
Try to get a copy of the book for a detailed description. It's a great book and you will learn a lot.
That said, how you transform your outputs (if it makes sense) depends on what you are doing and how those outputs are to be understood. You don't explain that in your question
I believe what most people do is to simply treat ordinal classification as a generic multi-class classification. So, if they have $K$ classes, they will have $K$ outputs, and simply use cross-entropy as the loss.
But some people have managed to invent a clever encoding for your ordinal classes (see this stackoverflow answer). It's a sort of one-hot encoding,
class 1 is represented as [0 0 0 0 ...]
class 2 is represented as [1 0 0 0 ...]
class 3 is represented as [1 1 0 0 ...]
i.e. each neuron is predicting the probability $P(\hat y < k)$. You still have to use a sigmoid as the activation function, but I think this helps the network understanding some continuity between classes, I don't know. Afterwards, you do a post-processing (np.sum
) to convert the binary output into your classes.
This strategy resembles the ensemble from Frank and Hall, and I think this is the first publication of such.
Best Answer
No! There is no limit whatsoever on the size of the output relative to the size of the input.
But in most cases, a higher number of outputs is not necessary at all.
In the case of language translation models: this is either done fragment by fragment, or the output has a fixed maximum size (e.g. 2x input size).