Is there a simulation method that is not Monte Carlo? All simulation methods involve substituting random numbers into the function to find a range of values for the function. So are all simulation methods in essence Monte Carlo methods?
Monte Carlo Methods – Are All Simulation Methods Some Form of Monte Carlo?
monte carlo
Related Solutions
I think there are two techniques that you may be thinking of
The difference between the algorithms, is how they set a new value target based on experience.
Using action values to make it a little more concrete, and sticking with on-policy evaluation (not control) to keep arguments simple, then the update to estimate $Q(S_t,A_t)$ takes the same general form for both TD and Monte Carlo:
$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha(X - Q(S_t,A_t))$$
Where $X$ is the value of an estimate for the true value of $Q(S_t,A_t)$ gained through some experience. When discussing whether an RL value-based technique is biased or has high variance, the part that has these traits is whatever stands for $X$ in this update.
For Monte Carlo techniques, the value of $X$ is estimated by following a sample trajectory starting from $(S_t,A_t)$ and adding up the rewards to the end of an episode (at time $\tau$):
$$\sum_{k=0}^{\tau-t-1} \gamma^kR_{t+k+1}$$
For temporal difference, the value of $X$ is estimated by taking one step of sampled reward, and bootstrapping the estimate using the Bellman equation for $q_{\pi}(s,a)$:
$$R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$$
Bias
In both cases, the true value function that we want to estimate is the expected return after taking the action $A_t$ in state $S_t$ and following policy $\pi$. This can be written:
$$q(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\tau-t-1} \gamma^kR_{t+k+1}|S_t=s, A_t=a]$$
That should look familiar. The Monte Carlo target for $X$ is clearly a direct sample of this value, and has the same expected value that we are looking for. Hence it is not biased.
TD learning however, has a problem due to initial states. The bootstrap value of $Q(S_{t+1}, A_{t+1})$ is initially whatever you set it to, arbitrarily at the start of learning. This has no bearing on the true value you are looking for, hence it is biased. Over time, the bias decays exponentially as real values from experience are used in the update process. At least that is true for basic tabular forms of TD learning. When you add a neural network or other approximation, then this bias can cause stability problems, causing an RL agent to fail to learn.
Variance
Looking at how each update mechanism works, you can see that TD learning is exposed to 3 factors, that can each vary (in principle, depending on the environment) over a single time step:
What reward $R_{t+1}$ will be returned
What the next state $S_{t+1}$ will be.
What the policy will choose for $A_{t+1}$
Each of these factors may increase the variance of the TD target value. However, the bootstrapping mechanism means that there is no direct variance from other events on the trajectory. That's it.
In comparison, the Monte Carlo return depends on every return, state transition and policy decision from $(S_t, A_t)$ up to $S_{\tau}$. As this is often multiple time steps, the variance will also be a multiple of that seen in TD learning on the same problem. This is true even in situations with deterministic environments with sparse deterministic rewards, as an exploring policy must be stochastic, which injects some variance on every time step involved in the sampled value estimate.
Best Answer
There are simulations that are not Monte Carlo. Basically, all Monte Carlo methods use the (weak) law of large numbers: The mean converges to its expectation.
Then there are Quasi Monte Carlo methods. These are simulated with a compromise of random numbers and equally spaced grids to yield faster convergece.
Simulations that are not Monte Carlo are e.g. used in computational fluid dynamics. It is easy to model fluid dynamics on a "micro scale" of single portions of the fluid. These portions have an initial speed, pressure and size and are affected by forces from the neighbouring portions or by solid bodies. Simulations compute the whole behaviour of the fluid by calculating all the portions and their interaction. Doing this efficiently makes this a science. No random numbers are needed there.
In meteorology or climate research, things are done similarly. But now, the initial values are not exactly known: You only have the meteorological data at some points where they have been measured. A lot of data has to be guessed.
As these complicated problems are often not continuous in their input data, you run the simulations with different guesses. The final result will be chosen among the most frequent outcomes. This is actually how some weather forecasts are simulated in principle.