It will be helpful to distinguish the model from inference you want to make with it, because now standard terminology mixes the two.
The model is the part where you specify the nature of: the hidden space (discrete or continuous), the hidden state dynamics (linear or non-linear) the nature of the observations (typically conditionally multinomial or Normal), and the measurement model connecting the hidden state to the observations. HMMs and state space models are two such sets of model specifications.
For any such model there are three standard tasks: filtering, smoothing, and prediction. Any time series text (or indeed google) should give you an idea of what they are. Your question is about filtering, which is a way to get a) a posterior distribution over (or 'best' estimate of, for some sense of best, if you're not feeling Bayesian) the hidden state at $t$ given the complete set of of data up to and including time $t$, and relatedly b) the probability of the data under the model.
In situations where the state is continuous, the state dynamics and measurement linear and all noise is Normal, a Kalman Filter will do that job efficiently. Its analogue when the state is discrete is the Forward Algorithm. In the case where there is non-Normality and/or non-linearity, we fall back to approximate filters. There are deterministic approximations, e.g. an Extended or Unscented Kalman Filters, and there are stochastic approximations, the best known of which being the Particle Filter.
The general feeling seems to be that in the presence of unavoidable non-linearity in the state or measurement parts or non-Normality in the observations (the common problem situations), one tries to get away with the cheapest approximation possible. So, EKF then UKF then PF.
The literature on the Unscented Kalman filter usually has some comparisons of situations when it might work better than the traditional linearization of the Extended Kalman Filter.
The Particle Filter has almost complete generality - any non-linearity, any distributions - but it has in my experience required quite careful tuning and is generally much more unwieldy than the others. In many situations however, it's the only option.
As for further reading: I like ch.4-7 of Särkkä's Bayesian Filtering and Smoothing, though it's quite terse. The author makes has an online copy available for personal use. Otherwise, most state space time series books will cover this material. For Particle Filtering, there's a Doucet et al. volume on the topic, but I guess it's quite old now. Perhaps others will point out a newer reference.
A long time ago I wrote this paper with a co-worker (which was summarily rejected when we submitted for publication...). It answers precisely the question you ask.
Are you sure betas add up to one? I am no expert in Finance, but I had the impression that perhaps a weighted average of betas (using relative market capitalizacion as weights) might add up to 1, rather than the betas themselves. Even of that I am not sure.
Answer to EDIT: of the original question:
1 & 2: Yes, the approach carries forward to any number of factors. However, while I am able to rationalize a restriction such as the average (or weighted average) of the betas being 1 when the single factor is excess market return over the non-risk asset, I think in the three factor model there is no obvious restriction to enforce.
No, the "response" is multivariate. In order to impose a (soft) restriction on the betas, you have to estimate all of them at once. The dimension of the
state vector (=number of sectors dealt with at once) in chunck 2, for instance, is nc=10.
3: In general, you can set 'lower' and 'upper' in the maximization routines to whatever values you think are reasonable bounds for the parameters. You must be aware, though, that the parameters in this model are the variances of the state and observation noises and cannot therefore be negative. The (time-varying) betas are computed in the state.
4: Rational for 0.05 and 8: simply we found (by trial and error) that these are reasonable initial values to achieve convergence of the MLE. In practice youj would try different sets of starting values
Best Answer
The transition matrix relates state t and state t-1.
If we write the temporal coherence equation like this
$$ x_t = \Psi x_{t-1} + \epsilon_p $$
This is the temporal model. This model tells you what is the tendency of your system. When no measurement is found, the system will follow this tendency. When it is found, there is a trade-off between where the measurement says the track should go and where the temporal model says it should go.
$\Psi$ is the transition matrix then.
You can have different types of transition matrix, for instance, temporal brownian motion, where $\Psi = I $, meaning that the next state is the last one plus some noise.
Another possibility would be constant velocity.
Imagine an easy example in 1d. We are tracking the position of an object and its velocity. It is just the same equation as above, in this particular case.
$$ \begin{bmatrix} x_t\\ vel_t \end{bmatrix} = \begin{bmatrix} 1 & 1\\ 0 & 1 \end{bmatrix} \begin{bmatrix} x_t\\ vel_{t-1} \end{bmatrix} + \epsilon_p $$
Then, if you multiply terms, you get
$$ x_t = x_t + vel_t + \epsilon_{p,x}$$ $$ vel_t = vel_{t-1} + \epsilon_{p,vel}$$
This example would be, as the second equation tells us, a constant velocity model.
If you still have doubts, there is a nice explanation of Kalman Filter here: http://web4.cs.ucl.ac.uk/staff/s.prince/book/book.pdf Chapter 19.