Solved – Guassian Process for Data Imputation

I recently came across Gaussian Processes in Gelman et al. (2013), and I am trying to learn more about their potential application for use in imputing time series data. The data of interest is a single variable time series of an individual's heart rate collected using a photoplethysmogram (PPG; an optic sensor that attaches to the end of a person's finger and measures changes in blood volume).

The problem is that we have certain sections of data that are messy. Existing editing strategies have been developed for dealing with these artifacts but they were optimized largely based on data collected from EKG sensors. The slow-wave form of the PPG, makes their application to our obtained data a little clunky at times.

Briefly, here is an example of an isolated messy section surrounded by good signal from the R Shiny App I have built to improve manual editing of our data:

The light grey line represents the original signal (downsampled from 2kH to 100Hz). The solid black line with the red points is a plot of the interbeat intervals (the time in seconds between successive heart beats) plotted over time. The interbeat intervals will be the primary variable in any analysis of these data.

For instance, using an individual's interbeat intervals we can assess their heart rate variability. Unfortunately, most editing strategies tend to drive down variability. Moreover, there are certain tasks when it is more likely that these artifacts will be present (because of participant movement), meaning I could not simply mark these messy sections for removal and treat them as missing at random.

The upside is that we do know a great deal about properties of heart rate. For instance, adults generally range between 60 and 100 BPM at rest. Also, we know that heart rate varies as a function of the respiration cycle, which itself has a known range of likely frequencies at rest. Finally, we know that there is a low frequency cycle that influences heart rate variability (thought to be influenced by a combination of sympathetic and parasympathetic influences on heart rate).

The relatively small section of "bad data" depicted above is actually not my primary concern. I have developed some reasonably accurate seasonal interpolation approaches that seem to work well in these sorts of isolated cases.

My issues arise more when dealing with data sections that have regularly interspersed bad signal with the good:

As I understand from Gelman et al. (2013), it seems possible to specify several different covariance functions for Gaussian processes. These covariance functions can be informed by the observed data and reasonably well known prior distributions for measures of adult (or child) cardiac and respiratory output.

For instance, say I have some some observed heart rate ($f_{HR}$), I could specify a Gaussian process governed by that average heart rate as follows (and please let me know if I am off on the math here as this is my first time trying to apply these models):

$$g_1(t) \backsim GP(0, k_1)$$

where

$$k_1(t, t') = \sigma_1^2exp\Bigg(-\frac{2sin^2(\frac{\pi(t-t')f_{HR}}{Hz})}{2l_1^2}\Bigg)$$

where $Hz$ is my sampling rate and $t$ is my index for time.

Based on the example Gelman et al. (2013) provide in their text, it seems possible to modify this covariance function to allow for variation over certain periods. For me I would like allow for variation in estimates of $f_{HR}$ within the respiration cycle and within the low frequency heart rate variability cycle I mentioned above.

To accomplish my first goal, as I understand it, this would mean specifying a Gaussian process and covariance function for respiration rate ($f_R$) and a Gaussian process that incorporates features of both processes in the covariance function:

$$g_2(t) \backsim GP(0, k_2)$$

where

$$k_2(t, t') = \sigma_2^2exp\Bigg(-\frac{2sin^2(\frac{\pi(t-t')f_{R}}{Hz})}{2l_2^2}\Bigg)$$

and…

$$g_3(t) \backsim GP(0, k_3)$$

where

$$k_3(t, t') = \sigma_3^2exp\Bigg(-\frac{2sin^2(\frac{\pi(t-t')f_{HR}}{Hz})}{l_{1,3}^2}\Bigg)exp\Bigg(-\frac{2sin^2(\frac{\pi(t-t')f_{R}}{Hz})}{2l_{2,3}^2}\Bigg)$$

If I were to stop at this point then my model would be something like:

$$y_t(t)=g_1(t)+g_2(t)+g_3(t)+\epsilon_t$$

EDITS/UPDATES:

First a more correct specification of my three Gaussian Processes is:

Beginning with the squared exponential covariance kernel:

$$g_1(t) \sim N(0, k_1)$$

$$k_1(t,t')=\sigma_1^2exp\Bigg(-\frac{(t-t')^2}{2l_1^2}\Bigg)$$

Next a covariance function that incorporates a quasi-periodic pattern based on the individual's heart rate.

$$g_2(t) \sim N(0, k_2)$$

$$k_2(t, t') = \sigma_2^2exp\Bigg(-\frac{2sin^2(\pi(t-t')f_{HR}}{l_2^2}\Bigg)exp\Bigg(-\frac{(t-t')^2}{2l_3^2}\Bigg)$$

And finally a covariance function that models heart rate variation as a function of the respiration cycle:

$$g_3(t) \sim N(0, k_3)$$

$$k_3(t, t') = \sigma_3^2exp\Bigg(-\frac{2sin^2(\pi(t-t')f_{HR}}{l_4^2}\Bigg)exp\Bigg(-\frac{2sin^2(\pi(t-t')f_{R}}{l_5^2}\Bigg)$$

OLD QUESTIONS (w/ updated answers in italics):

1) Knowing little about Guassian processes does this seem like a defensible application? They seem highly flexible and appear to have a number of desirable properties that could address my problems of retaining as much true variability in my data as possible, but I have only recently come across them and want to make sure I am not going to be disappointed in going down this rabbit hole.

My answer so far to this question is that no this has not been a rabbit hole, or well at least not an unproductive one. I have come closer with these models than any other to recovering "true" values in messy signal. I do wish I could figure out a way to reduce run time however as these models are computationally expensive to estimate. While I have made progress here (side note using Stan and rstan), I still have a ways to go.

2) Am I understanding and representing the basic features of the covariance functions correctly, and more importantly is my attempt at allowing variation in $f_{HR}$ as a function of $f_R$ correct (i.e., function $g_3(t)$)?

I believe the kernels specified above are consistent with my primary modeling aims. That being said there may be ways to reduce the complexity that is certainly leading to my overly long run times. This remains an area I am actively pursuing as I attempt to optimize my models now that I have the basics down.

3) More technically speaking how does one go about identifying values for $l$? And what is it precisely that $\sigma^2$ represents in each covariance function? It too seems like something I am going to have to specify a prior distribution for, but I don't fully understand what it represents in this case. Presumably a variance of some sort…

Right now, and almost certainly a factor in my time to convergence, I am estimating these values from the data. If I could either fix these values, or severely restrict their prior distributions so as to (defensibly) shrink the parameter space covered by the models I think it would go a long way to improving run time.

4) I have found an additional source that I am starting to go through about Gaussian processes (Rasmussen & Williams, 2006). Are there any other recommended resources out there I should look into to gain a better understanding of these models?

I have found a number of additional sources useful in my continued pursuit of a final modeling strategy. See below.

Fast methods for training Gaussian processes on large datasets – Moore et al., 2016

Fast Gaussian process models in stan – Nate Lemoine

Even faster Gaussian processes in stan – Nate Lemoine

Robust Gaussian processes in stan – Michael Betancourt

Hierarchical Gaussian processes in stan – Trangucci, 2016

NEW QUESTIONS

Is there a defensible way to constrain the length scale parameters in the model (the $l$'s) based on the frequencies I am working with (1.5 Hz, .25 Hz and the x-axis in seconds downsampled to 10 Hz).
What factors should I focus on to speed up modeling time? I know part of it is optimizing my stan code, but is there anything else I could be doing or changing about the parameterization of my model?

RESULTS TO DATE
Here is the best result to date (in a heavily downsampled data set). The red line represents the "true" signal. The blue is the model estimated signal for the same period:

Note: I would be largely satisfied with this result if I can speed up its estimation and smooth it out a bit.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press: New York.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press: Boston, MA.

Best Answer

While I cannot answer all of your questions and I am unable to post a comment due to lack of reputation, in response to this:

Is there a defensible way to constrain the length scale parameters in the >model (the l's) based on the frequencies I am working with (1.5 Hz, .25 Hz and >the x-axis in seconds downsampled to 10 Hz).

You may want to look at a spectral mixture kernel. This basically involves representing your covariance matrix by it's Fourier transform, and thus deals in frequencies as opposed to distances. This may more naturally encode some of your prior information.

You might find this page useful if this sounds interesting, and the original thesis is here.

Best Answer

Related Solutions

R – Imputation Methods for Time Series Data

Related Question