I need to build a model (M) that converts a 10 dimensional space of inputs (A) into a 20 dimensional space of outputs B.

Both the inputs and outputs are analog, so this is not a classification problem but rather a regression one.

A * M = B

In order to allow some bias I included a last row of 1s in A:

$A = \begin{bmatrix}

A & 1

\end{bmatrix}$

My first approach was to just to left multiply by the pseudoInverse of A to obtain M:

$M=A^{-1}B$

However this is not getting good results.

Analyzing what I did, I think it's equivalent to using a one layer neural network with bias but without activation function and without regularization.

Now, to improve my model I want to make it more complex using neural networks, but some questions arise.

I find a lot of examples in internet using neural networks for classification, where each output neuron represents a class and it's output is associated to the probability of the input belonging to that class.

However my problem has nothing to do with classification (think about it as an operator that takes some location in a 10 dimensional space and "moves" it to a new location in a higher dimensional space).

Is there some specific neural network topology (type of activation function, regularization, training method) that is specific for this problem? How would you approach such a problem?

Thanks a lot.

## Best Answer

Before giving up on linear models, you could also try regularized linear models. For example, you can penalize the $\ell_2$ norm of the weights (ridge regression), which expresses a preference for smaller weights. You can also penalize the $\ell_1$ norm (lasso), which induces sparse solutions (you can think of this as a kind of automatic feature selection). There's also the elastic net, which is a combination of $\ell_1$ and $\ell_2$ penalties. These techniques are very popular, and can improve generalization performance when appropriately matched to the problem. They can also make it possible to solve problems where the number of input variables exceeds the number of data points. You'll often see these techniques discussed in the context of regression problems with a single scalar output. But, you can also apply them in the case of vector-valued outputs. If searching for sparse solutions, you'd have to decide whether or not the weights for all outputs should share the same sparsity structure (i.e. should all columns of $M$ have zeros in the same rows as each other?).

As for neural nets, you're correct that your model is equivalent to a shallow network with linear activation functions. Many people use neural nets for regression, and producing vector-valued outputs is no problem. You'll want a feedforward network. The input layer should contain 10 units, and the output layer should contain 20 units. Because you're solving a regression problem, the output layer should use linear activation functions, which can represent any real number (unlike most nonlinear activation functions, which are either squashed or clipped to a specific range). For regression, you should also typically use the squared error as the loss function. If you want the network to implement a nonlinear function, you'll need at least one hidden layer with nonlinear activation functions (e.g. sigmoid, ReLU, etc.). Typically all units in the network would have a bias term. You'll probably want to pre-process your inputs by at least centering and standardizing them, and possibly performing PCA. Other than these recommendations, the world of neural nets is wide open, and the proper choices depend heavily on your problem (the number, size, and activation functions of hidden layers; initialization procedures; optimization/learning rules; regularization; etc.). The upside of this is that neural nets are very powerful. The downside is that you may have to spend a considerable amount of time exploring different choices.

You could also consider other nonlinear regression methods. For example, many standard techniques like k nearest neighbors, random forests, boosted decision trees, etc. could produce vector-valued outputs. These methods require fewer choices on your part than neural nets.