This experiment can be completely explained within classical physics. It must, because laser pointers produce coherent states which exactly match the predictions of classical electrodynamics. However, it is a very good analogy for the paradoxes you would face in a quantum eraser experiment with an electron beam.
The reason the analogy is good is because the light in the classical treatment is described by a wave equation that is very similar to the Schrödinger equation for a single massive particle. Thus, wavefunctions will diffract to create single blobs if given a single slit, and interfere to make fringes if given two slits. You can further encode two different waves in a single particle using a spin-½ degree of freedom, in a manner exactly analogous to the polarization degree of freedom of an EM wave.
We don't find this situation paradoxical in classical mechanics because the light shining on the screen can't be seen as a number of discrete 'packets', and its intensity comes in a continuum. It does not make sense to ask "where did the light that makes this bright fringe come from?" because it comes from both slits. If you place a detector on each of the slits, you do observe half the power going through each slit. Within classical physics, this occurs no matter how low the laser power is.
Suppose now, though, that you replace your laser for an electron gun. Since the wave mechanics remains (much) the same, the interference fringes - or lack thereof - in the wavefunction and therefore in the detection probability will not be altered. However, electrons do behave as particles fairly often. At low enough electron fluxes, you only ever measure single-electron hits on your detector, and you can ensure only a single electron is ever present in the apparatus. If you put detectors right after the slits, you don't observe half-electrons. It is here that it starts getting paradoxical: if when I observe the slits the electron is only ever in one of the two, how come the interference pattern changes if I have access to the 'which way' information? Note, though, that it's this extra layer of particleness that makes the quantum eraser weird.
Finally, what about light? Can one do a 'quantum' version of this experiment using light? After all, light also comes in photons, and you can power down your laser low enough that only single flashes will show in the screen, right? Well, for one you need to iron out a few wrinkles. For example, you need to make sure that those single flashes are indeed a property of the light and not of the detector; there exist fairly reasonable models which explain the photoelectric effect by quantizing only the atoms and not the field. This means, in particular, that you need to change your laser for a single photon source, which is a different beast altogether.
Even then, though, the experiment is not quite enough to be a paradox. The reason for this is that photons don't really have positions or trajectories or even, really, wavefunctions. They're single excitations of the corresponding classical modes, and the modes themselves exhibit interference and wave behaviour. (Indeed, the experiments where you get photons to behave like particle waves are quite different.) Thus, while you can put together a quantum eraser measurement with single photons, the situation is more complicated and calls for a more delicate analysis.
a) What would happen if we did NOT detect an interference at D0 8ns prior to the entangled idler photons reaching D3 or D4 and decide to remove the BSa and BSb beam splitters really really fast such that the idler photons would travel to either D1 or D2 instead, hence the "which path" is not known and therefore we should actually see an interference pattern at D0.
Just looking at the data at D0 alone you never see an interference pattern. Photons come through the initial double shift at a particular rate. Each time one comes through it experiences a spontaneous parametric down conversion so we have a pair of photons. When one half the pair for to D0 we detect it. Wherever it lands it lands. You have to get many results before you get any pattern at all. So you can't say "no interference" and in particular an interference pattern is really a frequency histogram. If you have two histograms with troughs of one aligned with peaks for the other, the combined aggregate frequency histogram doesn't have peaks and troughs. So you can't wait until you see no pattern and then remove the beam splitters. The pattern comes from labeling each hit at D0 with a time and then later sorting them into groups with peaks of one group on top of troughs of the another group. So the "interference pattern" comes later. Even without the additional beam splitters it comes later because R1 (coincidence with D1) and R2 (coincidence with D2) label the original D0 collection into two distinct groups. Imagine you see a pattern that didn't look like an interference pattern and then 8ns later from each hit you get information to label the individual dots with a happy face or a sad face and you see the happy face distribution develops peaks and troughs and the sad face distribution develops peaks and troughs and the troughs of one are the peaks of the other and vice versa.
Removing the additional beam splitters just means you have two things to sort the results into instead of four. You don't see an interference pattern in the aggregate results at D0, you only see it after you sort the results you see it in the two histograms. And you don't know which group any particular result will be sorted to until 8ns later when you detect at D1 or D2.
b) If we moved either the mirror Mb or Ma just a tiny little bit from it's position, such that the red or blue path for the photon would be a bit different in length, wouldn't then we be able to tell via the time it took to reach either D2 or D1 detectors which of the two slits it came from?
Firstly, there is some leeway, a beam that hasn't been on forever isn't perfectly monochromatic, so there is some room to move the mirrors a bit. Secondly what happens if you move the mirrors is that D1 and D2 will fire at different rates so now you will sort the results at D0 into two unequal groups and now the peaks and troughs of the two subgroups don't line up perfectly and the larger one looks less and less interference pattern shaped until at some magic distance only one detector D1 or D2 goes off (lets say at a certain distance only D1 goes off) and you are now sorting the results of D0 into just one group.
c) If we replaced D0 with another double split, with the red and blue path each pointed at one of the two slits. Would in the case of the idler photon reaching D3/D4, the signal photon choose exactly one of the slits, hence not interfere with itself?
Short answer, you are correct. However the waveform needs to move in multiparticle space. A wave function is not a field in three dimensional space. And this happened even without adding a second double slit. That's why the histograms of R3 and R4 don't have peaks and troughs (well they have one peak each and no trough unless you consider it a trough at infinity that they focus in a finite region). So a second double slit is irrelevant to R3 and R4.
In case you meant to use a second double slit for something else I'll go into more detail about what, if anything, it would do.
A double slit is not magic, and it only works a very particular way in very particular situations. For instance the original double split has a laser wavefront coming into it, so the waves coming out of each slit are in phase with each other. Furthermore the wave is for just one particle and there is no entanglement. Those fact serve to determine exactly where the peaks will be if you placed a screen in front of it. The light heading over to D0 is so very very different than monochromatic in-phase plane-wave laser light. It is entangled light heading towards D0, the parametric down conversion produces entangled light so each of those two red beams coming out of the SPDC region are entangled with each other. It's like there is a superposition of two particles (one traveling along each of those red beams). So each of those pair of beams coming out of the SPDC region is a superposition of states of different polarization. But worse than that they are entangled. So by themselves they don't individually have the properties associated with the entanglement. The red and blue beams could be deflected to have their propagation vectors be orthogonal to a screen with holes and directed towards the holes in the screen. If the holes are large compared to the beam widths it's like not having a screen with holes at all. If the holes are small then D0 will fire less often as the screen absorbs some photons. So you can reproduce those aspects of a standard double slit setup.
But the two beams are not arriving in phase and each is really an entangled superposition of different polarizations. So you can't expect a double slit there to work exactly the same as it would in a normal double slit set up.
Now normally in quantum mechanics you can track the lines of probability current and even make dynamical equations for them. If you do that you see that absorbing the edges of a beam makes the surviving (new) edge share out more.
So the new double slit will flare the beams more. The troughs weren't identically zero since the original slits were finite sized as well as finitely spaced (each beam had sine thickness). But more than this size and wavelength away from the central region of D0 the new double slit is now more spread out so you should detect more peaks and troughs. They happen because of the difference in part length from red and blue. There are multiple subgroups with different locations for peaks and troughs because the red and blue beams don't have a constant phase difference because of the entanglement with other beams.
Therefore, in case of the signal photon hitting an area of the screen it could not possibly hit when interfering with itself (gaps on a interference pattern), we would know for certain that the which path is known 8ns beforehand with just a single photon pair (signal/idler), in this special case?
The frequency histogram at D0 is the sum of the histograms for R1, R2, R3, and R4. And R3 and R4 have one central peak each, offset from each other since the red and blue aim for different places. And R1 and R2 have peaks in the other ones troughs and vice versa.
When you see a hit in the trough of R1 you now know it is much more likely that D1 does not go off 8ns later.
Best Answer
This is going to be a long answer because there is a lot to unpack here.
I'm going to briefly go over your questions about your understanding, then analyze the experiment from the video in detail, then explain why I think this experiment is not very interesting (since the video also tries to hype it).
Your understanding
It doesn't matter when the "eraser" is used, or when the experimenter looks at anything; you get the same results regardless. Also, there's never an interference pattern visible on the screen. That pattern only shows up in later data analysis.
That's correct in a certain sense. The way quantum mechanics works is that the system (as encoded in the wave function) interferes with itself as a whole. If there's which-path information anywhere in the system, then the paths are orthogonal, so they don't interfere.
But note that applying the "eraser" doesn't destroy the which-path information for this purpose, nor does any measurement, or anything else you can do without involving the photon. There is no way to destroy the which-path information unless you re-combine it with the photon, which doesn't happen in this experiment. "Quantum eraser" is really a misnomer.
That's correct of the "detectors" in this experiment. Typical thought-experiment "detectors" do collapse the wave function, though.
Observation by the experimenter is thermodynamically irreversible (and collapses the wave function), while the "detection" (by these detectors) is reversible (and doesn't). That distinction isn't important in this experiment, though. It's important that the "detection" at the slits be reversible, but the experimenter and the screen could just as well be reversible quantum computers. But it's probably easier, and just as correct, to think of them as wavefunction-collapsing classical objects.
The video (hype)
The video has some silly hype at the beginning:
I disagree with this; I think that it's just a less interesting version of the EPR/Bell/Aspect experiment. I'll explain what I mean by this after the analysis.
Those people are wrong. This isn't a matter of interpretation; they just don't understand how probability works. I'll come back to this after the analysis.
The video (post-hype)
I think (contra David Reishi) that the video is pretty accurate when it sticks to the physics, once you figure out what it means by "detector" and "quantum eraser".
In the video, the "detector" at each slit contains a two-state quantum system (a qubit) which is initially in some known state (say $|0\rangle$) and is flipped (to $|1\rangle$) if the photon passes through the detector. It might be quite hard to engineer such a device, but the laws of physics allow it: it's just a CNOT gate with the presence/absence of the photon as the control bit. (Arguably this shouldn't be called a detector or a measurement because it is reversible, but those are just words; it's clear what the device does physically.)
I'm going to simplify the setup a (qu)bit by omitting one of the detectors, because that doesn't lose any information: if the photon didn't go through that slit then it went through the other (in this idealized experiment free of engineering realities). The single detector is on the right slit, and gives us one qubit, which is $|0\rangle$ if the photon went through the left slit and $|1\rangle$ if it went through the right slit.
The "quantum eraser", which looks like a reject from a new-age crystal-healing video, is a quantum computer which simply applies a Hadamard gate to the qubit (that is, it takes $|0\rangle$ to $(|0\rangle+|1\rangle)/\sqrt2$ and $|1\rangle$ to $(|0\rangle-|1\rangle)/\sqrt2$).
The most serious error, or misleading statement, in the video is the implication that the "quantum eraser" erases which-path information, allowing the wave function to interfere again (13:04). That is not possible in quantum mechanics. There is nothing you can do to the qubit (or anything in the universe other than the photon), at any time, that will affect the observable behavior of the photon.
Analysis
Just after the "detection", the system is in the state $$|0\rangle|\text{photon in left slit}\rangle + |1\rangle|\text{photon in right slit}\rangle$$ (times $1/\sqrt2$; I'm going to ignore normalization factors for the most part). The photon then propagates to the screen. Just before the photon hits the screen, the state of the system is $$\sum_P (\alpha_P |0\rangle|\text{photon at P}\rangle + \beta_P |1\rangle|\text{photon at P}\rangle) = \sum_P (\alpha_P |0\rangle + \beta_P |1\rangle) |\text{photon at P}\rangle$$ where the sum is over all the points on the screen (all the pixels of the CCD, if you like).
If we had not put a detector at the slit, that sum would have been $$\sum_P (\alpha_P |\text{photon at P}\rangle + \beta_P |\text{photon at P}\rangle) = \sum_P (\alpha_P + \beta_P) |\text{photon at P}\rangle,$$ and the probability of finding the photon at P would therefore have been $|\alpha_P + \beta_P|^2$, which depends on the relative phase of $\alpha_P$ and $\beta_P$ (whose phases depend on the distances to the left and right slits respectively). But with the detector at the slit, the states scaled by $\alpha_P$ and $\beta_P$ are orthogonal, and so the probability of finding the photon at P is $|\alpha_P|^2 + |\beta_P|^2$, which does not depend on the relative phase. This is why the interference pattern disappears. In effect, there is interference only between identical states of the whole world, not just identical states of the photon we're detecting.
Suppose the photon is actually detected at P (which is now a specific point, not a bound variable of the summation). The state of the system (after wavefunction collapse) is now $\alpha_P |0\rangle + \beta_P |1\rangle$ times an overall normalization factor that we don't care about.
At this point we have a choice (the delayed choice): measure the qubit in the {0,1} basis, or apply the "quantum eraser" (Hadamard gate) and then measure the qubit in the {0,1} basis. In the former case, we'll get 0 and 1 with relative probabilities $|\alpha_P|^2$ and $|\beta_P|^2$ respectively. In the latter case, we'll get 0 and 1 with relative probabilities $|\alpha_P+\beta_P|^2$ and $|\alpha_P-\beta_P|^2$ respectively. (They're relative because of the normalization factors that I ignored.)
Now think about what happens if we condition (in the ordinary classical sense) on getting 0 or 1 in this measurement. If we didn't apply the Hadamard gate, $|\alpha_P|$ and $|\beta_P|$ just fall off monotonically with increasing distance from the respective slits, and they will be roughly equal, so we'll get 0 for around half the dots and 1 for the other half, across the board. If we did apply the Hadamard gate, $|\alpha_P \pm \beta_P|$ oscillate at the frequency of the interference pattern we would have gotten in the standard double-slit experiment. If P is near one of the peaks of $|\alpha_P+\beta_P|$, we'll almost certainly get 0; if it's near one of the troughs, we'll almost certainly get 1. So conditioning on Hadamard 0 or 1, we'll get an interference pattern.
Why the people who think this shows backward causation are wrong
You have a bag containing 4 balls, 2 red and 2 black. You draw a ball. There's a 1/2 chance it will be red. If it is red, there's a 1/3 chance that the second ball you draw will be red.
But if you don't look at the first ball, there's a 1/2 chance that the second ball you draw will be red, and if it is, there's a 1/3 chance that the first ball you drew was red. If you collect data over many trials, conditioned on the second ball being red, you'll find that indeed about 1/3 of the first balls you drew were red.
Backward causation? Of course not. If X is correlated with Y, then Y is correlated with X. It doesn't matter whether Y happened after X.
The argument for backward causation in the delayed-choice quantum eraser experiment is exactly the same as the argument for backward causation in this classical experiment.
Why this is a less interesting version of Bell's experiment
The initial state ($|0\rangle|\text{photon in left slit}\rangle + |1\rangle|\text{photon in right slit}\rangle$) is just a Bell state ($|0_A\rangle|0_B\rangle + |1_A\rangle|1_B\rangle$). There are three major differences between this experiment and the Bell experiment:
The measurement axes range over the whole Bloch sphere (linear/elliptical/circular polarization for photons, or any spin axis for electrons), while in the Bell experiment they're restricted to a plane (linear polarization for photons, or a spin axis in some plane for electrons).
The universe chooses the orientation of one of the detectors. (The dot on the screen encodes both the orientation and the result of the measurement, in this analogy.) For the other measurement we choose between two orientations (eraser or no eraser).
The measurements of the two halves of the Bell pair are timelike separated, not spacelike separated.
The first difference doesn't matter much; I think you can still derive a version of the Bell inequality with this change. (If you can't, that's just another reason why this experiment is less interesting.)
The second and third differences make the experiment much less interesting, because they each independently make the whole thing consistent with a local hidden variable theory.
You can think of Bell's experiment as a game show similar to The Newlywed Game: the contestants are allowed to talk and agree on a strategy, then they're separated and independently asked questions that they didn't know in advance. Their goal is to give answers that are correlated in a certain sense.
If they're allowed to choose the question asked of contestant A, they can win easily, since they can agree on contestant A's question and answer in advance, and contestant B then has all relevant information when deciding how to answer. Likewise, if B's question is in the future light cone of A's question and answer, they can win if A surreptitiously sends that information to B by radio or dark matter or other light-speed-limited means.
You could replace my ball experiment with a classical simulation of the delayed-choice quantum eraser. For example: a computer generates two random numbers $x,y \in [0,1]$ and prints out two copies which are given to two oracles. The first oracle simply tells you the value of $x$. The second oracle, on your command, either tells you whether $y < \tfrac12$ or whether $y < \sin^2 100x$. If you condition on the answer to the first question, you get a flat histogram of $x$ values, while if you condition on the answer to the second question, you get a sinusoidal histogram. With minor tweaks to the second oracle's formulas, this simulation becomes exact.
Bell wrote an essay called "Bertlmann's Socks and the Nature of Reality" in which he presented a thought experiment similar to my four-ball experiment (involving his colleague Dr. Bertlmann who always wears mismatched socks), as an example of what the EPR experiment is not about. I think the people who invented the delayed-choice quantum eraser didn't understand the difference between quantum mechanics and Bertlmann's socks.