Solved – Structuring longitudinal clinical input for LSTM RNN

lstmrecurrent neural network

Totally new to LSTMs and would like some guidance on how to structure input data for classification using multivariate longitudinal data. Most, if not all, tutorials online are non-healthcare related and I could not find a good analogy as an example to work from.

The problem: predicting that a hospitalization event will occur in the next 30-90 days (where days 0-29 is a lag period for intervention).

The data: Each observation is one patient, and each patient has several lab outcomes with many sequential values. Each patient also has non-sequential data, like gender and race.

Questions:
1. How do I specify the individual values of one lab outcome's sequence from another's?
2. How do I input non-sequential variables along with sequential?
3. How should LSTM parameters/architecture be adjusted to accommodate?

My thinking so far is that the data should be structured using an array, where each row is the sequence of a lab outcome, and non-sequential attributes exist as sequences with the same value in each column. I assume that the LSTM inherently knows that each row is another variable.

Here are two papers that are helpful for context:
1. Learning to diagnose with LSTM RNNs https://arxiv.org/abs/1511.03677v7
2. Multi task prediction of disease onset from longitudinal lab tests https://arxiv.org/abs/1608.00647v3

I am using python: keras with tensorflow.

Best Answer

Your concept of an input array where each row is the sequence of a different lab is correct, so long as the observations are at the same time intervals. For example, each lab sequence having one observation per day would work fine. But if you also had observations of other intervals, such as hourly or weekly, you might not want to mix them. In this case you could have multiple LSTM networks that operate on each time scale you need, and their hidden states all connect into your outputs, but that would be getting in pretty deep.

As for how to specify one lab's outcomes from another, this will happen automatically as long as you keep the order of your different labs consistent across all time steps. For example, if lab A is your first input value at timestep T, and lab B is your second, as long as you have the same order at timestep T+1 and so on it won't get confused.

There are two options I'm aware of for adding non sequential data. You mentioned the first one, just have an input that is the same at every time step. Another option would be to process your sequences without them and get a hidden state, then add on each non sequential input to the end of the hidden state vector and calculate your output. For example, if your hidden state size is 50 and you had 3 different inputs you wanted to add on at the end, you would get a vector of length 53 that connects into your outputs.

You may have to experiment with what works best. If one of your non sequential inputs should significantly change the way your sequence is processed it may be best to add it in at every time step. But if not, adding it in over and over can just slow things down and sometimes even decrease performance by drowning out the other more important inputs.