Solved – How to model multiple inputs to multiple outputs

multiple regressionmultivariate analysisregression

1. My purpose

I am trying to build a model with multiple inputs and multiple outputs, which is something like this:

I am not sure if I need to firstly integrate the xi into X, and yi into Y so as to make the model easier like this:

2. Features of the training data

There are several features for the training data:

There are some missing factors for different entries due to lack of resources.
For example,
X1 may not have x3 and x5, Y1 may not have y4 and y5,
X2 may not have x1 and x2, Y2 may not have y4.
There might be multiple projections for X=g(Y). For example

Of course, I would like to minimise the number of projections so as to build a more generalised model.

3. My question

So may I ask how could I tackle this problem?

I feel that the regression (e.g. polynomial regression) and classification (e.g. logistic regression, neural network) models only require one sigle output for each entry.

I also do not think PLS is the right answer as PLS essentially models multiple x variables to a single yi instead of considering the Y=Σyi as a whole.

It seems that it is about modelling multivariate (not multivariable) regression. But how could I deal with missing factors?

Best Answer

tl;dr I recommend this, but only pending extensive visual data exploration.

Your problem as univariate classification

I was about to write an overview of supervised multivariate techniques, but then I realized I'd rather recast your problem.

Given $X$=(charge, hydrophobicity, beta-sheet propensity, ...) and $Y$=(pH, ionic strength, sugar, NaCl concentration, etc.), predict $Z$(=1 if stable, otherwise 0).

This tidily addresses the possibility of multiple $Y$ values for a given $X$, and now you have a "simple" univariate prediction problem. You mentioned logistic regression and neural networks, and those would be good baseline methods to try... if you have any examples with $Z=0$.

Your problem as unsupervised learning

So what if you only have examples where $Z=1$? You can't train a classifier. All you can do is assume that there is something generalizable about the points in your dataset -- some set of relationships that hold up across all the proteins and their stabilizing conditions. For example, maybe sugar concentration minus salt concentration always equals inverse hydrophobicity (I'm just bullshitting, of course). Common tools for uncovering structure in data include:

autoencoders: a neural net with your data as both input and output, and a bottleneck layer in the middle.
Principal components analysis or principal curves
The graphical lasso and its cousins, which convert the observed correlation structure into a network with a limited number of edges. I talk more about my favorite options below.

Missing data overview

Regarding your missing data, it's helpful to know what it was caused by. In a clinical trial, patients might drop out due to side effects and thereby skew the results irreparably. You mentioned resource limitations, which means your missingness pattern might be independent of the missing values. (This is especially true if hydrophobic proteins are no more or less expensive to measure that the rest, and so on for all your measurements.) If that is a leap you're willing to make, then not only can you fill in the missing data; you can reasonably quantify the uncertainty in your model parameters.

One way to do this is multiple imputation:

formulate a probabilistic model for the missing data
simulate missing data from that model
complete your task as if no data were missing
repeat this many times and combine the resulting estimates via Rubin's Formulas (slide 7).

This page contains tons of information on missing data and multiple imputation. But, if you just want a point estimate, keep reading.

Specific models for missing data

In this situation, I might try a rank $r$ regression model

$$E[M] = RL$$,

where $M$ is $[X|Y]$ ($n$ rows, $p$ columns), $R$ is an unknown $n$ by $r$ matrix, $L$ is an unknown $r$ by $p$ matrix, and $E[]$ is the expectation operator. (Substitute in your preferred likelihood function or data transform.) You can fit this model despite the missing entries: using a simple squared loss as an example, minimize $$\sum_{i,j \in \Omega} (M_{ij} - \sum_k R_{ik}L_{kj})$$ (where $\Omega$ is the set of observed entries). You can alternate between updating $R$ and updating $L$, so that each update is just a regression problem, and you'll probably do well.

Rather than low rank, another option is to use a flexible distribution based on a sparse inverse covariance matrix. (Why inverse covariance? It's the Markov random field representation.) This solution seems to accommodate all of your needs at once, so I linked to it in the tl;dr.

Do due diligence

Finally, please do not fit complex models to your data without first visualizing it and exploring it. I recommend scrutinizing all of the pairwise relationships by plotting them somehow (scatterplots for continuous, contingency tables for categorical, and side-by-side boxplots for mixed). This might reveal outliers or physically implausible trends that bear investigation. It might reveal that my answer is off track or requires modification: perhaps you'll end up with a couple of protein clusters that behave very differently, so you decide to model them each separately. Perhaps the data are just really ugly, and there's nothing to see other than "super acidic and salty things are unstable." I'm interested to know what you find.