The class of score-driven models might be of interest to you:
Creal, D. D., S. J. Koopman, and A. Lucas (2013). Generalized autoregressive score models with applications. Journal of Applied Econometrics 28(5), 777--795
Score-driven models were applied to a binary time series of outcomes of The Boat Race between Oxford and Cambridge,
https://timeserieslab.com/articles.html#boatrace
In that paper, the time-varying probability was obtained with the score-driven methodology by using the (free) Time Series Lab software package.
The score-driven model for binary observations in short:
Our observations can take on either two values: 0 and 1.
We therefore assume that these observations come from the Binary distribution with probability density function (pdf)
\begin{equation}\label{eq:pdf}
p(y_t | \pi_t) = \pi_t^{y_t} (1-\pi_t)^{1-y_t}
\end{equation}
where $\pi_t$ is a time-varying probability and $y_t \in \{0,1\}$ for $t = 1,\ldots,T$ where $T$ is the
length of the time series.
We can specify the time-varying probability $\pi_t$ as a function of a dynamic process $\alpha _t$, that is
\begin{equation}\label{eq:pi}
\begin{aligned}
\pi_{t+1} &= f(\alpha_t),\\
\alpha_{t+1} &= \omega + \phi \alpha_t + \kappa s_t ,
\end{aligned}
\end{equation}
where the link function $f(\cdot)$ is the logit link function so that $\pi_t$ takes values between 0 and 1.
You can easily include more lags of $\alpha_t$ or $s_t$ into the equation above.
The unknown coefficients include the constant $\omega$, the autoregressive parameter $\phi$, and the score updating parameter $\kappa$ which are estimated by maximum likelihood.
The driving force behind the updating equation of $\alpha_t$ is the scaled score innovation $s_t$
as given by
\begin{equation}\label{eq:score}
\qquad s_t = S_t \cdot \nabla_t, \qquad \nabla_t = \frac{\partial \, \log \, p(y_t | \pi_t)}{\partial \alpha_t},
\end{equation}
for $t = 1,\ldots,T$ and where $\nabla_{t}$ is the score of the density $p(y_t | \pi_t)$ and $S_t$ a scaling factor which is often the inverse of the Fisher information.
[Disclaimer] I am one of the developers of Time Series Lab.
When trying to get an intuitive understanding of formal mathematical models, it is usually best to start with a simple model and then generalise later. So, with that in mind, let's start with an AR$(1)$ model with zero mean based on a white-noise series $\varepsilon_i \sim \text{IID N}(0,1)$. This model can be written in scalar form as:
$$Y_t = \phi Y_{t-1} + \sigma \varepsilon_t.$$
Now, you can substitute in this auto-regression to get $Y_t$ in terms of earlier and earlier terms:
$$\begin{equation} \begin{aligned}
Y_t
&= \phi Y_{t-1} + \sigma \varepsilon_t \\[6pt]
&= \phi (\phi Y_{t-2} + \sigma \varepsilon_{t-1}) + \sigma \varepsilon_t \\[6pt]
&= \phi^2 Y_{t-2} + \sigma (\varepsilon_t + \phi \varepsilon_{t-1}) \\[6pt]
&= \phi^2 (\phi Y_{t-3} + \sigma \varepsilon_{t-2}) + \sigma (\varepsilon_t + \phi \varepsilon_{t-1}) \\[6pt]
&= \phi^3 Y_{t-3} + \sigma (\varepsilon_t + \phi \varepsilon_{t-1} + \phi^2 \varepsilon_{t-2}) \\[6pt]
&= \phi^3 (\phi Y_{t-4} + \sigma \varepsilon_{t-3}) + \sigma (\varepsilon_t + \phi \varepsilon_{t-1} + \phi^2 \varepsilon_{t-2}) \\[6pt]
&= \phi^4 Y_{t-4} + \sigma (\varepsilon_t + \phi \varepsilon_{t-1} + \phi^2 \varepsilon_{t-2} + \phi^3 \varepsilon_{t-3}) \\[6pt]
&= \cdots \\[6pt]
&= \phi^k Y_{t-k} + \sigma \sum_{i=0}^{k-1} \phi^i \varepsilon_{t-i}.
\end{aligned} \end{equation}$$
If $|\phi| < 1$ then this first term vanishes as $k \rightarrow \infty$ and then you have the MA$(\infty)$ representation:
$$Y_t = \sigma \sum_{i=0}^\infty \phi^i \varepsilon_{t-i}.$$
This shows you that if $|\phi| < 1$ then you can write an AR$(1)$ process as an MA$(\infty)$ process. The infinite sum in this expression is called a generating function, and in this representation it allows you to find the distribution of the observable series of values.
Using the characteristic polynomial: Rather than doing all this in scalar form, the model can be written using the lag-operator $L$ as:
$$\phi(L) Y_t = \sigma \varepsilon_t,$$
where $\phi(L) = 1 - \phi L$ is the auto-regressive characteristic polynomial (which in this case is an affine function). Now, it turns out that this polynomial function can be inverted in the same way as a polynomial function involving a real or complex number (as opposed to the lag operator). That is, if $|\phi| < 1$ then the polynomial follows the inversion rule for an infinite geometric sum:
$$\phi^{-1}(L) = \frac{1}{1-\phi L} = \sum_{i=0}^\infty \phi^i L^i.$$
Applying this to the process you get the MA$(\infty)$ representation we derived in scalar form before:
$$Y_t = \sigma \phi^{-1}(L) \varepsilon_t = \sigma \sum_{i=0}^\infty \phi^i L^i \varepsilon_t = \sigma \sum_{i=0}^\infty \phi^i \varepsilon_{t-i}.$$
You can see from the above that it is possible to deal with time-series models via scalar methods, without using the lag operator at all. By introducing the lag operator, and polynomial functions of this operator, certain calculations (like the above inversion) become much simpler. In order to verify that these are allowable, mathematicians have to appeal to the theory of functions and operators to establish that polynomials involving the lag function behave like polynomials involving real/complex numbers. Once they have established that, this allows them to use polynomials involving the lag operator to simplify changes of form in time-series models.
Best Answer
Hi: Your model is also called a koyck distributed lag and it can be difficult to estimate with small samples. With larger samples, my experience is that there is not a problem with bias. ( I used simulation to check this ).
The link discusses the statistical properties of the estimates briefly on pages 12 and 13. Essentially the problems with it are similar to those of the estimates of an AR(1).
https://www.reed.edu/economics/parker/312/tschapters/S13_Ch_3.pdf
I would check out hamilton or the little koyck book (1954) for a more in depth discussion but hopefully the above helps some.