For accessing a complexity of a model, number of free parameters is a good start, with it you can calculate AIC or BIC from number of free parameters. And getting number of free parameters in a Multi Layer Perception (MLP) neural network can be found here: Number of parameters in an artificial neural network for AIC

In addition, there are some cases, that you have a lot parameters, but they are not "totally free" / with regularization. For example, for linear regression, if you have $1000$ features but $500$ data points, it is totally OK to fit a model with $1000$ coefficients, but regularize the coefficients with a large regularization parameter. You can search Ridge Regression or Lasso Regression for details.

In Neural network case, it is also possible people have a very compacted network structure (many layers many neurons) but with some regularization in there. In that case, the method mentioned above will not work.

Finally, I would not agree your statement about random forest. As discussed in Breiman's original paper: in creasing number of trees is will not lead a more complex model / have over fitting. Instead, the out of bag (OOB) error will converge, if you have large number of trees. In practice, if computational power is not a concern, building a random forest with large number trees is actually recommended.

To your comment:

The model complexity is an abstract concept, and can be defined in different ways. AIC and BIC are some definitions and other way of defining it exists. See this Definition of model complexity in XGBoost as an example.

In addition, it is fine, if two NN has different structure, but it is still can have same complexity. Here is an example: say, we are doing polynomial regression. You have 2 ways, one is have a higher order model with more regularization, another is lower order without regularization. You can have same "complexity" but the structure are different.

Before giving up on linear models, you could also try regularized linear models. For example, you can penalize the $\ell_2$ norm of the weights (ridge regression), which expresses a preference for smaller weights. You can also penalize the $\ell_1$ norm (lasso), which induces sparse solutions (you can think of this as a kind of automatic feature selection). There's also the elastic net, which is a combination of $\ell_1$ and $\ell_2$ penalties. These techniques are very popular, and can improve generalization performance when appropriately matched to the problem. They can also make it possible to solve problems where the number of input variables exceeds the number of data points. You'll often see these techniques discussed in the context of regression problems with a single scalar output. But, you can also apply them in the case of vector-valued outputs. If searching for sparse solutions, you'd have to decide whether or not the weights for all outputs should share the same sparsity structure (i.e. should all columns of $M$ have zeros in the same rows as each other?).

As for neural nets, you're correct that your model is equivalent to a shallow network with linear activation functions. Many people use neural nets for regression, and producing vector-valued outputs is no problem. You'll want a feedforward network. The input layer should contain 10 units, and the output layer should contain 20 units. Because you're solving a regression problem, the output layer should use linear activation functions, which can represent any real number (unlike most nonlinear activation functions, which are either squashed or clipped to a specific range). For regression, you should also typically use the squared error as the loss function. If you want the network to implement a nonlinear function, you'll need at least one hidden layer with nonlinear activation functions (e.g. sigmoid, ReLU, etc.). Typically all units in the network would have a bias term. You'll probably want to pre-process your inputs by at least centering and standardizing them, and possibly performing PCA. Other than these recommendations, the world of neural nets is wide open, and the proper choices depend heavily on your problem (the number, size, and activation functions of hidden layers; initialization procedures; optimization/learning rules; regularization; etc.). The upside of this is that neural nets are very powerful. The downside is that you may have to spend a considerable amount of time exploring different choices.

You could also consider other nonlinear regression methods. For example, many standard techniques like k nearest neighbors, random forests, boosted decision trees, etc. could produce vector-valued outputs. These methods require fewer choices on your part than neural nets.

## Best Answer

Neural networks can in principle model nonlinearities automatically (see the universal approximation theorem), which you would need to explicitly model using transformations (splines etc.) in linear regression.

The caveat: the temptation to overfit can be (even) stronger in neural networks than in regression, since adding hidden layers or neurons looks harmless. So be extra careful to look at out-of-sample prediction performance.