Perhaps I am misunderstanding the question, but for classification it seems to me the standard way is to have an output neuron for each of the N
classes.
Then the N
vector of [0, 1]
output values represent the probability of the input belonging to each class, and so can be interpreted as the "confidence" you want to obtain.
Before giving up on linear models, you could also try regularized linear models. For example, you can penalize the $\ell_2$ norm of the weights (ridge regression), which expresses a preference for smaller weights. You can also penalize the $\ell_1$ norm (lasso), which induces sparse solutions (you can think of this as a kind of automatic feature selection). There's also the elastic net, which is a combination of $\ell_1$ and $\ell_2$ penalties. These techniques are very popular, and can improve generalization performance when appropriately matched to the problem. They can also make it possible to solve problems where the number of input variables exceeds the number of data points. You'll often see these techniques discussed in the context of regression problems with a single scalar output. But, you can also apply them in the case of vector-valued outputs. If searching for sparse solutions, you'd have to decide whether or not the weights for all outputs should share the same sparsity structure (i.e. should all columns of $M$ have zeros in the same rows as each other?).
As for neural nets, you're correct that your model is equivalent to a shallow network with linear activation functions. Many people use neural nets for regression, and producing vector-valued outputs is no problem. You'll want a feedforward network. The input layer should contain 10 units, and the output layer should contain 20 units. Because you're solving a regression problem, the output layer should use linear activation functions, which can represent any real number (unlike most nonlinear activation functions, which are either squashed or clipped to a specific range). For regression, you should also typically use the squared error as the loss function. If you want the network to implement a nonlinear function, you'll need at least one hidden layer with nonlinear activation functions (e.g. sigmoid, ReLU, etc.). Typically all units in the network would have a bias term. You'll probably want to pre-process your inputs by at least centering and standardizing them, and possibly performing PCA. Other than these recommendations, the world of neural nets is wide open, and the proper choices depend heavily on your problem (the number, size, and activation functions of hidden layers; initialization procedures; optimization/learning rules; regularization; etc.). The upside of this is that neural nets are very powerful. The downside is that you may have to spend a considerable amount of time exploring different choices.
You could also consider other nonlinear regression methods. For example, many standard techniques like k nearest neighbors, random forests, boosted decision trees, etc. could produce vector-valued outputs. These methods require fewer choices on your part than neural nets.
Best Answer
You would have to output vectors of means and standard deviations rather than discrete values to achieve that.
One solution to get those vectors would be variational inference - generate those, sample w/reparametrization, then optimize so the results of the sampling match the original values like in normal regression (i.e. MSE/MAPE/MAE/whatever loss) and regularize the means and stddev to 0/1 respectively.
Essentially the same process as a vanilla Variational Autoencoder, except you're not bound by the Autoencoder architecture, and you want the means/stddevs as the outputs of the trained network rather than the sampled values.