Solved – Use of shuffled dataset for training and validating lstm recurrent neural network models

lstmmachine learningrecurrent neural networksentiment analysistime series

I am trying to build a recurrent neural net model using lstm trying to predict future outputs from a financial time series. Outputs are classified in macro classes according to the magnitude of absolute changes.
My inputs are the variable i am trying to predict itself plus some other features.
I build windows of inputs each one representing a sample, The window size is fixed and equal to the chosen timestep of the model. So each samples is an ordered sequence of timestep features.
The approach i am thinking to use for data preparation Is between the following two:
Approach A
1) split the initial dataset so that first 80% of observations is my training set and the last 20% Is my final test set
2) randomly shuffle the first set and then divide It in two parts, training and validation for hyper parameters selection. Also I would choose the training set so that output classes are equally represented in the training (i.e. Say i have three output classes, i drop from the training set the samples associated with the two most frequent labels in excess of the least represented so that labels are 33% each). Note: The shuffle involves the batches of sequences of course, so each sequence would still preserve its own internal order.
3) standardize the three sets (shuffled training set, shuffled validation set, unshuffled ordered test set) according to the first training set magnitude

Approach B
1) shuffle the whole dataset as first thing (of course I mean shuffle the batches of sequences, each one would still be ordered in its inside)
2) splitting It in three parts, training validation and test sets using same stratification approach described above
3) standardize as in approach A

Which of the two would you think could be a more robust approach?
Please bear in mind I am not using a stateful kind of model.
My doubt Is that, because of the sequence nature of each sample, by shuffling samples the test set will contain some "deja-vu" data that have already seen during learning thus leading to misleading expectations when tested on future realizations.

Thanks for your opinions

Best Answer

The most important problems with ANN models for price change prediction is that there may be many days (minutes, weeks, trading bars) that are redundant, so you may have a large amount of trading data you don't need -- these won't help the ANN. Thus, you can run PCA on the days (bars) as features, get the correlation matrix, extract the "major" eigenvalues >1 (fast trick), and then use the PC scores for the components that are associated with the major eigenvalues as the pseudo-days. (thus when done, you won't even have your original trading data, but rather maybe 500 days, instead of e.g. 4 x 250 (trading days/year) = 1000 days.

Assuming your data is a $t \times p$ matrix, where $t$ is the number of days(bars) and $p$ is the number of features, when you do the PCA, turn the data sideways so that days are variables (columns) and original features (assets, stocks, signals) are in rows. Run correlation on days to get the $t \times t$ correlation matrix $\mathbf{R}$. After PCA, you will get a $t \times m$ loading matrix, $\mathbf{L}$, where $m$ is the number of PCs whose eigenvalues $\lambda_j>1$. Specifiy during the run that you want the PC scores (not the "PC score coefficients"), and that matrix will be $t \times m$, call this the $\mathbf{Z}$ matrix since the columns (each PC) are distributed $\cal{N}(0,1)$. The new data matrix to feed to your ANN is the $t \times m$ $\mathbf{Z}$ matrix. This is time consuming, but is an important step that's 30-40 years old when using ANNs with time series price data.

Another important issue with ANNs is that correlation between input features waste learning time, since the ANN will learn the correlation, which you don't want. Thus, this is why it is common to run PCA on input features first to decorrelate by using the PCs (which are orthogonal, i.e., zero correlation with one another). However, you first have to solve the problem of using wasted (redundant) days in the input data, mostly because they don't help the learning process.

See the DDR package from Jurik Research.