I have data in the shape (batch, seq_len, features)
that is a time series sliding window. In essence, I'm using the most recent seq_len
steps in order to predict a single target variable. This means that the output of the last seq_len
value in my MultiHeadAttention
layer should be the predicted value.
I've made many attempts at generating different attention_mask
s to use in Keras' MultiHeadAttention
but none of them quite capture the behavior I want, inevitably leading to poor results. Ultimately I only want the importance of each seq_len
query step relative to the last key step. It's basically an autoregressive additive model using the transformer architecture (only using the encoder). The last step is to tf.reduce_sum
over the entire seq_len
in order to get the output.
Future modifications to the attention layer might be using teacher forcing which should further improve the learning phase and reduce the obvious influence of the correlation of the last value with itself, but I can't figure out how to correctly mask in the first place for continuous time series data like this. To be clear this is NOT an NLP model.
Best Answer
In the attention model, the mask required is of the shape (batch, queries, keys) so in order to train the entire horizon (queries) on the last value (last key):