Solved – Dealing with very large time-series datasets

feature selectionfeature-engineeringlarge datamachine learning

I have access to a very large dataset. The data is from MEG recordings of people listening to musical excerpts, from one of four genres. The data is as follows:

  • 6 Subjects
  • 3 Experimental repetitions (epochs)
  • 120 Trials per epoch
  • 8 seconds of data per trial at 500Hz (=4000 samples) from 275 MEG channels

So each "example" here is a matrix of size [4000×275], and there are 2160 of such examples, and that's before any feature extraction. The goal is to predict genre based on the brain signal (4-class classification).

Clearly there are some challenging issues here, namely:

  1. The dataset does not fit in memory
  2. There will be strong temporal correlations in the data, and the inter-subject variation will be huge. As a result it's not obvious how to split the data
  3. The signal-to-noise ratio is very low
  4. It's not obvious what the correct features for a classifier would be

Taking these in turn:

  1. There are various things one can do. Firstly we can safely downsample from 500Hz to ~200Hz, as even taking the Nyquist limit into account, brain activity doesn't really occur at over 100Hz. We could also subsample from the the set of channels (e.g. centre over auditory areas) but we'd rather not do this a-priori, as there may be activity in other areas (frontal etc) that could be of interest. We can probably also drop a portion of the time window. Perhaps only the first 2s is important for the task? It's not really known. Of course everyone will shout "Dimensionality reduction!", but that's not trivial either. Firstly, we'd have to be very careful about our train/test splits (see 2.) and it's also not obvious whether to do this before or after feature generation. Secondly, other than expensive cross-validation, or painstaking visual inspection, there's no obvious way to select either the appropriate method or the appropriate number of dimensions. We could of course just use e.g. PCA, ICA, or random projections and hope for the best ….

  2. This is tricky. If we have successive samples in the training set we are likely to overfit the training set, whereas if we have successive samples split into to the train and test sets we're likely to underfit the training set, but could still overfit the testing set. There seem to be various options here:

    • Single Subject Classification. Take each individual subject on their own, and split according to epochs. This should be the easiest task, as we're not trying to predict across brains. Within this one could use the two remain epochs for cross-validation. For completeness one should rotate all combinations. We would simply report the average accuracy over all subjects. Of course we would not expect these models to generalise well at all.
    • Within subjects classification. Take all of the subjects together, and split according to epochs. This may in fact be the easiest task, as we will have seen all of the subjects in training. However we probably wouldn't expect the models to generalise well to new subjects. Within this one could use the two remain epochs for cross-validation. For completeness one should rotate all combinations.
    • Between subjects classification. Also known as "leave-one-out", where a single subject is taken as test data, and the rest are used for training. We would then rotate through all subjects. Cross-validation would then be performed over subjects. We would expect this to be a much more difficult task, as we are trying to predict on a "new brain" each time. Here we would expect the models to generalise well to the larger population, although there is an issue of test-retest reliability (i.e. how much overfitting is caused by temporal correlations).
  3. This is a classical "needle in a haystack" problem – the actual signal relating to the recognition of musical genre, or any genre-specific processing, is likely to be minuscule compared to the "soup" of activity in the brain. There are also notable artefacts which can only be partially removed (mainly related to movement). Any features that we derive from the data, and any ways in which the data is treated, should avoid destroying part of the signal of interest.

  4. Here one could imagine doing various things. The first would be to simply use the raw data (concatenated into a vector) as the feature vector. I'm not sure how fruitful that is though – I think these vectors would probably be essentially uniformly random. This is then really a signal processing question, but there are some general guidelines one can follow. One is to do standard Fourier Analysis over a sliding window, from where the components can be split into distinct frequency bands (alpha/beta/gamma etc), and statistics of these (mean, std. deviation) can be used as features. Or one could use Wavelets, Hilbert Transforms, or even attempt to look for chaotic attractors. Of course then we have the choice of kernels (linear, polynomial, RBF etc) which multiplies up the number of permutations. Perhaps the best thing to do here is generate as many different feature sets as possible, and then use MKL or boosting methods to combine them together.

How would you approach this kind of dataset (if not this one specifically)? Is there anything I've missed along the way? What is the most likely strategy to succeed, without spending endless amounts of researcher time and computational resources?

Best Answer

@tdc. All and many more issues which you have mentioned here regarding the analysis of neuroscience data including: Dimensionality reduction, Within/between subjects classification, signal-to-noise ratio, etc. etc. are being handle by the EEGLAB toolbox, which were specifically designed for handling those kind of neuroscience data :

EEGLAB is an interactive Matlab toolbox for processing continuous and event-related EEG, MEG and other electrophysiological data incorporating independent component analysis (ICA), time/frequency analysis, artifact rejection, event-related statistics, and several useful modes of visualization of the averaged and single-trial data.

Thus, with regard to your question "What is the most likely strategy to succeed, without spending endless amounts of researcher time" I would like to encourage you to watch the EEGLAB on line workshop , and to continue from there...

Update: for more ML stuff take a look on the (new) BCILAB model

Related Question