Solved – An approach to visualise/investigate multi-dimensional data

data visualizationtime series

I'm attempting to analyse a large chunk of empiric measurements subject to two parametrised transformations. In essence, the functions take 3 'count' parameters – and returns a sequence of floats in each configuration. I'm expecting (hoping) to see some interesting patterns emerge when appropriate parameters are selected. I anticipate that the patterns might be relative between sequences returned for each function – and/or relate to patterns of some kind in the parameters. In case it's relevant, the 3 'count' parameters roughly correspond to:

  • 'window size' on the underlying data over which summary statistics are calculated
  • A number of consecutive windows used to compute a single summary statistic (i.e. the trade-off between greater spatial or greater temporal accuracy)
  • A 'minimum age' – an offset into history of the underlying data.

The summary statistics are non-trivial but are independently sensitive to all three parameters.

I'm interested in visualisation techniques – suited to ad-hoc enquiry – that will help me experiment with this multi-dimensional data.

EDIT (In response to Peter Ellis): Your 'pun' is, essentially, my problem. I've not stored the data – instead, I can calculate it 'on the fly' from (otherwise opaque) bulk empirical data. If I were to store the results of my calculation in matrix form, it would be a 4D matrix… The first three dimensions are all temporal – as identified above – then the index for the resulting sequence is also temporal – and the values are a dimensionless problem-specific scalar metric… represented using a floating point number. One might imagine the data to visualise as associating a distribution (function) with each 'voxel' in a cuboid. I visualise these sequences (representing the distribution associated with each voxel) in isolation, as regularly sampled continuous functions. One way I've imagined the data is with the sequences as line-graphs, and three 'tweaking knobs' I can use to 'tune in' a result… another is as any one of thee surface plots with two 'tweaking knobs'. A more elaborate visualisation animates a range of values for one of the 'tweaking knobs' – leaving me one parameter to set manually. The structure of the data does not suggest which parameters are best in which context – and the answer to this may depend upon both the empiric data (yet to be collected in full) and the scale of the parameters. A significant complication is that I don't – as yet – know how sensitive the results of my calculations will be to the parameters above… One objective is to find values for the first two parameters to minimise the impact of the third. Beyond that, and analogies between my parameters and time, I'm afraid the data is pretty abstract… there's no real-world object from which to draw inspiration.

Best Answer

Your data seems to be of the form $u=f(x,y,z,t)$, i.e., a time series for each point in space, where the space coordinates are window size, number of windows and offset. This can either be seen as a 4-dimensional array (the function $f$) or a set of points $(x,y,z,t,u)$ in a 5-dimensional space (the graph of $f$).

Here are a few ideas to visualize high-dimensional datasets.

The "grand tour" (available in applications such as ggobi) is an animation that shows the cloud of points rotating in space, i.e., several, more or less random, projections of the 5-dimensional space into the plane. Since, for this dataset, the first four coordinates $(x,y,z,t)$ are arranged in a grid, you would just see that grid.

Parallel coordinate plots and general dimensions reduction methods (PCA, MDS) are likely to present the same problems, because of the presence of the grid: the data really is 4-dimensional.

You may be able to adapt some of the plots described in J. Klemela's book, Smoothing of Multivariate Data (they are designed for densities, but should also work for functions defined on a grid, as here), but they are not very standard, and understanding what they actually mean takes a long, long time.

You could slice the data: take points $(x,y,z)$ at random and plot the corresponding time series: you may be able to group them into different patterns (some could be increasing, others decreasing, others present a bump, some could be noisy, some could be smooth, etc.), either manually, or using some clustering algorithm.

You could aggregate the data in the time dimension: for each point $(x,y,z)$, you could compute some "metrics" of the corresponding time series, e.g., maximum, minimum, average, range, absolute variation, etc. Each of those could be visualized as a 3-dimensional contour plot.

You could aggregate the data in the space dimensions: plot $\sum_{x,y,z} f(x,y,z,t)$ versus $t$ (a single curve) or $\sum_{x,y} f(x,y,z,t)$ versus $t$ for all values of $z$ (many curves, either on the same plot or on different plots). You could replace the sum with the averge, the median, the standard deviation, etc.

Related Question