The documentation for BSTS says the following about coefficients
If object contains a regression component then the output contains matrix with rows corresponding to coefficients, and columns corresponding to:
The posterior probability the variable is included.
The posterior probability that the variable is positive.
The conditional expectation of the coefficient, given inclusion.
The conditional standard deviation of the coefficient, given inclusion.
A there are several examples online of researchers using BSTS and showing a plot similar to the one below> I am not sure how to interpret the Y-axis. I read through the documentation, but it is still not clear to me. Does anyone use this package?
Solved – How to interpret the coefficients in the BSTS package in R
bayesianbstsrstate-space-modelstime series
Related Solutions
The idea behind the differences that you are finding in your modelling is the following.
Suppose you have a left-hand-side variable $y$ and a regressor $x$. Suppose also that the variables are stationary (for simplicity). Then you have the following equations:
(1a) $y_t=\alpha_0+\alpha_1x_t+u_t$
(1b) $u_t=\psi_1u_{t-1}+...+\psi_pu_{t-p}+v_t+\zeta_1v_{t-1}+...+\zeta_qv_{t-q}$
There are two ways you can proceed.
- Case A: you estimate equation (1a) first, obtain the residuals, then use them to estimate equation (1b).
- Case B: you estimate both equations simultaneously.
In case A, in the first stage you implicitly make an assumption that the residuals in equation (1a) are independent across time. Then in the second stage you revoke this assumption and model the residuals from equation (1a) as an ARMA process given by equation (1b). (So there is an intrinsic contradiction between the assumptions in stage 1 and stage 2 that shows the weakness of this approach.)
In case B, you assume that the residuals from equation (1a) follow an ARMA process as given in equation (1b) and use this information when estimating equation (1a).
Thus there is a difference between the two cases; no wonder you get different results in your modelling where I think you are indeed comparing case A and case B.
I think you incorrectly assume that "the intercept shown to come from the regression using xreg= as the X variables, before any arima model is done on those residuals". This is very close to the truth, but not exactly true; you miss the fact that the function arima()
uses the assumption given by equation (1b) when estimating equation (1a).
Thus the difference that you are finding in your modelling is supposed to be there.
There is also a case C:
(2) $(y_t-\mu_y)=\phi_0+\phi_1(y_{t-1}-\mu_y)+...+\phi_p(y_{t-p}-\mu_y)+\epsilon_t+\theta_1\epsilon_{t-1}+...+\theta_q\epsilon_{t-q}+\beta_1x_t$
where $\mu_y$ is the mean of $y$. Not sure, but perhaps one should also subtract the mean of $x$ to make the last term $\beta_1(x_t-\mu_x)$ instead of just $\beta_1x_t$.
Actually, case C seems more natural to me than case B given how we specify the arguments in arima()
. Of course, this is subjective judgement. However, the interpretation of the coefficient $\beta_1$ in case C is much more difficult than in case B.
A nice related piece by prof. Hyndman can be found here. I borrowed some ideas from there when answering this post, so I can only recommend reading the original piece!
Disclaimer: I am not an expert on this topic, but I hope I got the idea right. If not, corrections are mostly welcome!
Q1: What is the connection between PC time series and "maximum variance"?
The data that they are analyzing are $\hat t$ data points for each of the $n$ neurons, so one can think about that as $\hat t$ data points in the $n$-dimensional space $\mathbb R^n$. It is "a cloud of points", so performing PCA amounts to finding directions of maximal variance, as you are well aware. I prefer to call these directions (which are eigenvectors of the covariance matrix) "principal axes", and the projections of the data onto these directions "principal components".
When analyzing time series, the only addition to this picture is that the points are meaningfully ordered, or numbered (from $1$ to $\hat t$), as opposed to being simply an unordered collection of points. Which means that if we take firing rate of one single neuron (which is one coordinate in the $\mathbb R^n$), then its values can be plotted as a function of time. Similarly, if we take one PC (which is a projection from $\mathbb R^n$ on some line), then it also has $\hat t$ values and can be plotted as a function of time. So if original features are time series, then PCs are also time series.
I agree with @Nestor's interpretation above: each original feature can be then seen as a linear combination of PCs, and as PCs are uncorrelated between each other, one can think of them as basis functions that the original features are decomposed into. It's a little bit like Fourier analysis, but instead of taking fixed basis of sines and cosines, we are finding the "most appropriate" basis for this particular dataset, in a sense that first PC accounts for most variance, etc.
"Accounting for most variance" here means that if you only take one basis function (time series) and try to approximate all your features with it, then the first PC will do the best job. So the basic intuition here is that the first PC is a basis function time series that fits all the available time series the best, etc.
Why is this passage in Freeman et al. so confusing?
Freeman et al. analyze the data matrix $\hat{\mathbf Y}$ with variables (i.e. neurons) in rows (!), not in columns. Note that they subtract row means, which makes sense as variables are usually centred prior to PCA. Then they perform SVD: $$\hat {\mathbf Y} = \mathbf{USV}^\top.$$ Using the terminology I advocate above, columns of $\mathbf U$ are principal axes (directions in $\mathbb R^n$) and columns of $\mathbf{SV}$ are principal components (time series of length $\hat t$).
The sentence that you quoted from Freeman et al. is quite confusing indeed:
The principal components (the columns of $\mathbf V$) are vectors of length $\hat t$, and the scores (the columns of $\mathbf U$) are vectors of length $n$ (number of voxels), describing the projection of each voxel on the direction given by the corresponding component, forming projections on the volume, i.e. whole-brain maps.
First, columns of $\mathbf V$ are not PCs, but PCs scaled to unit norm. Second, columns of $\mathbf U$ are NOT scores, because "scores" usually means PCs. Third, "direction given by the corresponding component" is a cryptic notion. I think that they flip the picture here and suggest to think about $n$ points in $\hat t$-dimensional space, so that now each neuron is a data point (and not a variable). Conceptually it sounds like a huge change, but mathematically it makes almost no difference, with the only change being that principal axes and [unit-norm] principal components change places. In this case, my PCs from above ($\hat t$-long time series) will become principal axes, i.e. directions, and $\mathbf U$ can be thought as normalized projections on these directions (normalized scores?).
I find this very confusing and so I suggest to ignore their choice of words, but only look at the formulas. From this point on I will keep using the terms as I like them, not how Freeman et al. use them.
Q2: What are the state space trajectories?
They take single-trial data and project it onto the first two principal axes, i.e. the first two columns of $\mathbf U$). If you did it with the original data $\hat{\mathbf Y}$, you would get two first principal components back. Again, projection on one principal axis is one principal component, i.e. a $\hat t$-long time series.
If you do it with some single-trial data $\mathbf Y$, you again get two $\hat t$-long time series. In the movie, each single line corresponds to such projection: x-coordinate evolves according to PC1 and y-coordinate according to PC2. This is what is called "state space": PC1 plotted against PC2. Time goes by as the dot moves around.
Each line in the movie is obtained with a different single trial $\mathbf Y$.
Best Answer
If I understand what you are plotting correctly the y-axis corresponds to the expected value of the coefficient's posterior, for the regression component of the model.
On the
model$coefficients
matrix the columns correspond to the predictors, and rows to the MCMC steps.