In the Fokker-Planck equation, the unknown function (called here $p$) is a spatial probability density function at a given time $t$. We can write the Fokker-Planck equation as follows :
$$\left\{\begin{array}{ll}\frac{\partial p}{\partial t}+\frac{\partial}{\partial x}\big(\mu(x,t)p\big)
-\frac{\partial^2}{\partial x^2}\left(D(x,t)p\right)=0\\
p(x,t_0)=f(x).\end{array}\right.$$
This govern the evolution of the probability distribution function from the initial condition $f(x)$. An important property of the Fokker-Planck equation is the called mass conservation. The quantity $\int_{\mathbb R}p(x,t)\mathrm dx$ is independent of $t$ and is equal to $\int_{\mathbb R}f(x)\mathrm dx$. If $f(x)=\delta(x-x_0)$, we have $\int_{\mathbb R}p(x,t)\mathrm dx=1$ for all time $t$. As the equation is linear, we deduce that if we call $p(x,t|x_0,t_0)$ the solution of the Fokker-Planck equation
with $f(x)=\delta(x-x_0)$ then the solution $p(x,t|f,t_0)$ to the Fokker-Planck equation starting at $t_0$ with the probability distribution $f$ is
$$p(x,t|f,t_0)=\int_{\mathbb R}p(x,t|x_0,t_0)f(x_0)\mathrm dx_0.$$
Note that this is another form of the "probability rule" (also called the Chapman-Kolmogorov relation) because $f(x)=p(x,t_0|f,t_0)$.
We have just shown that $p(x,t|x_0,t_0)$ is the Green's function of the
operator
$$\frac{\partial}{\partial t}+\frac\partial{\partial x}\Big(\mu(x,t)\cdot\Big)-\frac{\partial^2}{\partial x^2}\Big(D(x,t)\cdot\Big).$$
In the case of time dependent coefficient $\mu$ and $D$, the question becomes very much dependent on the actual expressions of $\mu$ and $D$.
For instance, if $D(x,t)=\mathscr Dt$ and $\mu=0$, the solution is exactly obtained from the usual Green's function as
$$p(x,t|0,0)=\frac{1}{\sqrt{2\pi\mathscr Dt^2}}\exp\left(-\frac{x^2}{2\mathscr D t^2}\right).$$
But this is an exceptional situation. There are usually no exact solutions.
If the time variation of $\mu$ and $D$ are bounded, a possible approach is to use a multiple-scale expansion consisting of introducing several slow time scales and a perturbation parameter. This is more robust than the standard perturbative expansion especially for time-dependent problems. Many other techniques exist, such as the method of matched asymptotic expansions. Solving time dependent partial differential equations is a difficult problem in general.
Here is a very dirty and quick derivation. All feedback appreciated. From Ito we can derive the KBE
$$\frac{\partial F}{\partial t}+\mu(x,t)\frac{\partial F}{\partial x}(x,t)+\frac{1}{2}\sigma^2(x,t)\frac{\partial^2 F}{\partial x^2}(x,t)=0$$
for $t \in [0,T]$. From this we get
$$\int_\mathbb{R} \frac{\partial F(y,T)}{\partial T}p(y,T;x,t)dy=\\
=(-1)\int_\mathbb{R} \bigg(\mu(y,T)\frac{\partial F}{\partial y}(y,T)+\frac{1}{2}\sigma^2(y,T)\frac{\partial^2 F}{\partial y^2}(y,T)\bigg)p(y,T;x,t)dy$$
By assuming sufficiently strong decay conditions at $+\infty$ and $-\infty$, integration by parts yields just
$$\int_\mathbb{R} \frac{\partial F}{\partial y}(y,T)\mu(y,T)p(y,T;x,t)dy=\int_\mathbb{R}F(y,T)(-1)\frac{\partial}{\partial y}(\mu(y,T)p(y,T;x,t))dy$$
$$\int_\mathbb{R} \frac{\partial^2 F}{\partial y^2}(y,T)\sigma^2(y,T)p(y,T;x,t)dy=\int_\mathbb{R}F(y,T)\frac{\partial^2}{\partial y^2}(\sigma^2(y,T)p(y,T;x,t))dy$$
So we end up with
$$\frac{\partial }{\partial T}E[F(X_T,T)|X_t=x]=\int_{\mathbb{R}}F(y,T)\bigg(\frac{\partial p}{\partial T}+\frac{\partial}{\partial y}(\mu p)-\frac{1}{2}\frac{\partial^2}{\partial y^2}(\sigma^2p)\bigg)dy$$
If $(F(X_t,t))_{t \in [0,T]}$ is a martingale, we want that derivative wrt $T$ to be $0$. So we set the PDE in the parenthesis to $0$:
$$\frac{\partial p}{\partial T}=-\frac{\partial}{\partial y}(\mu p)+\frac{1}{2}\frac{\partial^2}{\partial y^2}(\sigma^2p)$$
Indeed this is the Fokker-Planck equation for $T >t$.
To answer to your question about point (3): if we assume that we are alowed to choose $F(x,t)=\delta(x-y)$ then
$$F(X_t,t)=\delta(X_t-y)\implies E[F(X_t,t)|X_0=x_0]=p(y,t;x_0,0)$$
$$dF(X_t,t)=\bigg(\delta_x(X_t-y)\mu(X_t,t)+\frac{1}{2}\delta_{xx}(X_t-y)\sigma^2(X_t,t)\bigg)dt
+\delta_x(X_t-y)\sigma(X_t,t)dW_t$$
The problem here is that $dF$ would be shorthand for the integral
$$\delta(X_t-y)-\delta(x_0-y)=\int_{[0,t]}\bigg(\delta_x(X_s-y)\mu(X_s,s)+\frac{1}{2}\delta_{xx}(X_s-y)\sigma^2(X_s,s)\bigg)ds+\\
+\int_{[0,t]}\delta_x(X_s-y)\sigma(X_s,s)dW_s$$
and I tried, but struggled to make sense of this and obtain something like the FPE without bending the rules too much. So, as an initial assessment, I think that using the Dirac delta is too far fetched to be an acceptable 'dirty' argument.
Best Answer
I will start by apologising for a physicist's level of rigor in the following derivations but I think they give good insight into the connection between what I would call stochastic physics' Trinity of Langevin, Fokker-Planck and Path integral.
Fokker-Planck from Langevin equation
Given a Langevin equation $dZ_t=b(Z_t)dt+\sigma(Z_t)dW_t$ (the most common form found in physics) we can derive the Fokker-Planck equation from the Chapman-Kolmogorov equation in a quick and dirty way:
$$ p(y,t|x)=\int dz\, p(y,t|z)p(z,t'|x) $$
We suppose that $t = \delta t$ and thus we can write $$p(y,t|z) = \left\langle \delta (y-z-h)\right\rangle_h$$ where $h=\delta Z$ is defined by the associated Langevin equation. Now we have
$$ p(y,t|x)=\int dz\, \left\langle \delta (y-z-h)\right\rangle_hp(z,t'|x)\,.\tag{$\ast$} $$
We can then Taylor expand the delta function as
$$ \left(1+\left\langle h\right\rangle \frac{\partial}{\partial y} +\frac{1}{2}\left\langle h^{2}\right\rangle \frac{\partial^2}{\partial y^2} +\ldots\right)\delta(y-z) = (1+\mathscr{L})\delta(y-z)\,, $$
then integrating by parts and the delta function we are left with
$$ p(y,t|x)=(1+\mathscr{L^\dagger})p(y,t'|x)\,. $$
Expanding $p(y,t|x)= p(y,t'|x) + \frac{\partial}{\partial t}p(y,t'|x) + \ldots$ in the limit $t \rightarrow 0$ we arrive at the Fokker-Planck equation
$$ \frac{\partial p(y,t|x)}{\partial t}=\mathscr{L^\dagger}p(y,t|x) $$
relabelling $t'$ as $t$.
Path integral from Langevin
We apply the Chapman-Kolmogorov equation many times:
$$ p(y,t|x)=\int\prod_{i=1}^{N}dz_{i}\, p(y,t_{N}|z_{N})\ldots p(z_{2},t_{2}|z_1)p(z_{1},t_{1}|x) $$
And then use $(\ast)$ to replace each along with the identity $\delta(x) = \int dk \exp(ikx)$ we get a sequence of averages over complex exponentials:
$$ p(y,t|x)=\int Dz Dk\, \left\langle e^{ik_N(z_N-z_{N-1}-h_N)}\right\rangle\ldots \left\langle e^{ik_2(z_2-z_{1}-h_2)}\right\rangle \left\langle e^{ik_1(z_1-x-h_1)}\right\rangle $$
however we can still do better. Knowing the probability distribution of $h$ (usually Gaussian in physics) we can write $$ \left\langle O \right\rangle = \int dh P(h) O $$ thus after replacing every averaged over exponential we get
$$ p(y,t|x)=\int Dz Dk Dh\, e^{i\int dt \,k\dot{z}} P[h_t] $$
and when you perform the integrations over the paths of $k$ and $h$ you will get your path integral.
Path integral from Fokker-Planck
I am less sure about how to go about directly showing the equivalence without the presence of a Langevin equation. However if we create a Lagrangian by multiplying the F-P equation by an auxiliary field and then integrate over all paths of our probability density and our auxiliary field, I have a feeling this will work but I am not certain so I will leave my answer here.
The main point of this is to show that the path integral and the FP equation are essentially two different representations of the Chapman-Kolmogorov equation. As far as I can tell they are basically interchangeable and I hope someone more knowledgeable than myself can come in and explain why you would use one over the other in certain situations.
I hope this has been useful!