Transformers – How to Properly Mask MultiHeadAttention for Sliding Window Time Series Data

attentionneural networkstensorflowtransformers

I have data in the shape (batch, seq_len, features) that is a time series sliding window. In essence, I'm using the most recent seq_len steps in order to predict a single target variable. This means that the output of the last seq_len value in my MultiHeadAttention layer should be the predicted value.

I've made many attempts at generating different attention_masks to use in Keras' MultiHeadAttention but none of them quite capture the behavior I want, inevitably leading to poor results. Ultimately I only want the importance of each seq_len query step relative to the last key step. It's basically an autoregressive additive model using the transformer architecture (only using the encoder). The last step is to tf.reduce_sum over the entire seq_len in order to get the output.

Future modifications to the attention layer might be using teacher forcing which should further improve the learning phase and reduce the obvious influence of the correlation of the last value with itself, but I can't figure out how to correctly mask in the first place for continuous time series data like this. To be clear this is NOT an NLP model.

Best Answer

In the attention model, the mask required is of the shape (batch, queries, keys) so in order to train the entire horizon (queries) on the last value (last key):

def timeseries_sliding_window(length):
    """Returns a mask of shape (length, length) where only the last entry of the 
    2nd dimension is of relevance to all elements of the 1st
    """
    return tf.concat([tf.zeros((length - 1, length)), tf.ones((1, length))], 0)