Thanks for sharing your code. timeseries objects are not accepted into trainNetwork, so you do not need to convert your numeric sequences into timeseries when preparing your data. Allocating array2d into X should create an appropriate cell array for trainNetwork
X{VideoNumber} = array2d;
Looking at your network architecture I noticed another issue. The inputSize argument of sequenceInputLayer should not correspond to the number of time steps in your data. Instead, inputSize is the fixed data dimension of your sequences, so it should be 40*200 = 8000 to fit with your data. Networks with a sequenceInputLayer can accept an arbitrary number of time steps, so if you had a video which had fewer than 2000 frames, the network would still be able to determine a classification for the video.
Best Answer