Neural Networks – Understanding Components of Temporal Fusion Transformer

I'm currently reading the paper Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting: https://arxiv.org/abs/1912.09363v3

However, I had to stop at page 7/8, which contains the different pieces of the TFT, because I lack the knowledge of most of the pieces mentioned there. Thus, my following questions are:

1.) What are the different components I have to understand, in order to understand how the TFT works? So far I got: LSTM (and Gates), Transformers (with Encoders/Decoders), and Multi-Headed-Attention-Blocks, are there more parts I have to understand the TFT, e.g. what about the GRN or Dense?

2.) Do you know of any literature that explains these components in terms of time-series AND are not that math-heavy?
I normally learn the maths from an example much better.
So far, I read through a few blogs, looked some tutorials or vids, even a video about the paper itself, but they do not focus so strong on the individual components. Furthermore, nearly every tutorial explains parts of the TFT as a problem of speech-translation for example, and not as a time-series problem.

Although I did some transfer to time-series. I have no further literature to crossvalidate my thoughts to.

So far I read/viewed:

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
by Brandon Rohrer:https://www.youtube.com/watch?v=WCUNPb-5EYI
Short intro to RNN and then LSTM: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting by Arvid Kingl https://www.youtube.com/watch?v=M7O4VqRf8s4&t=1106s
Transformer Neural Networks by CodeEmporium: https://www.youtube.com/watch?v=TQQlZhbC5ps

All the best

Best Answer

First of all, you should understand why Temporal Fusion Transformer(TFT) is such an awesome model.

The biggest advantages of TFT are versatility and interpretability. In other words, the model works with multiple time series, with all sorts of inputs (even categorical variables!). Also, it is not a black-box model: With attention weights you can can find which features are important, as well as the dominant seasonal patterns in your dataset! How cool is that?

I recommend reading this article. It explains in depth the different components of TFT and how they work together. A slightly difficult concept to grasp in case you are not familiar with is the Attention mechanism. Besides, TFT is a Transformer-based model so it uses attention (with some extra perks). If you hear the term attention for the first time (in the context of deep learning), take a look here - this is the best source on the internet that explains attention, plus it uses illustrations!

(Disclaimer: I am the author of the first article)

Best Answer

Related Solutions

Solved – the intuition behind a Long Short Term Memory (LSTM) recurrent neural network

Inference – How to Use the Transformer for Inference?

Related Question