a) What would happen if we did NOT detect an interference at D0 8ns prior to the entangled idler photons reaching D3 or D4 and decide to remove the BSa and BSb beam splitters really really fast such that the idler photons would travel to either D1 or D2 instead, hence the "which path" is not known and therefore we should actually see an interference pattern at D0.
Just looking at the data at D0 alone you never see an interference pattern. Photons come through the initial double shift at a particular rate. Each time one comes through it experiences a spontaneous parametric down conversion so we have a pair of photons. When one half the pair for to D0 we detect it. Wherever it lands it lands. You have to get many results before you get any pattern at all. So you can't say "no interference" and in particular an interference pattern is really a frequency histogram. If you have two histograms with troughs of one aligned with peaks for the other, the combined aggregate frequency histogram doesn't have peaks and troughs. So you can't wait until you see no pattern and then remove the beam splitters. The pattern comes from labeling each hit at D0 with a time and then later sorting them into groups with peaks of one group on top of troughs of the another group. So the "interference pattern" comes later. Even without the additional beam splitters it comes later because R1 (coincidence with D1) and R2 (coincidence with D2) label the original D0 collection into two distinct groups. Imagine you see a pattern that didn't look like an interference pattern and then 8ns later from each hit you get information to label the individual dots with a happy face or a sad face and you see the happy face distribution develops peaks and troughs and the sad face distribution develops peaks and troughs and the troughs of one are the peaks of the other and vice versa.
Removing the additional beam splitters just means you have two things to sort the results into instead of four. You don't see an interference pattern in the aggregate results at D0, you only see it after you sort the results you see it in the two histograms. And you don't know which group any particular result will be sorted to until 8ns later when you detect at D1 or D2.
b) If we moved either the mirror Mb or Ma just a tiny little bit from it's position, such that the red or blue path for the photon would be a bit different in length, wouldn't then we be able to tell via the time it took to reach either D2 or D1 detectors which of the two slits it came from?
Firstly, there is some leeway, a beam that hasn't been on forever isn't perfectly monochromatic, so there is some room to move the mirrors a bit. Secondly what happens if you move the mirrors is that D1 and D2 will fire at different rates so now you will sort the results at D0 into two unequal groups and now the peaks and troughs of the two subgroups don't line up perfectly and the larger one looks less and less interference pattern shaped until at some magic distance only one detector D1 or D2 goes off (lets say at a certain distance only D1 goes off) and you are now sorting the results of D0 into just one group.
c) If we replaced D0 with another double split, with the red and blue path each pointed at one of the two slits. Would in the case of the idler photon reaching D3/D4, the signal photon choose exactly one of the slits, hence not interfere with itself?
Short answer, you are correct. However the waveform needs to move in multiparticle space. A wave function is not a field in three dimensional space. And this happened even without adding a second double slit. That's why the histograms of R3 and R4 don't have peaks and troughs (well they have one peak each and no trough unless you consider it a trough at infinity that they focus in a finite region). So a second double slit is irrelevant to R3 and R4.
In case you meant to use a second double slit for something else I'll go into more detail about what, if anything, it would do.
A double slit is not magic, and it only works a very particular way in very particular situations. For instance the original double split has a laser wavefront coming into it, so the waves coming out of each slit are in phase with each other. Furthermore the wave is for just one particle and there is no entanglement. Those fact serve to determine exactly where the peaks will be if you placed a screen in front of it. The light heading over to D0 is so very very different than monochromatic in-phase plane-wave laser light. It is entangled light heading towards D0, the parametric down conversion produces entangled light so each of those two red beams coming out of the SPDC region are entangled with each other. It's like there is a superposition of two particles (one traveling along each of those red beams). So each of those pair of beams coming out of the SPDC region is a superposition of states of different polarization. But worse than that they are entangled. So by themselves they don't individually have the properties associated with the entanglement. The red and blue beams could be deflected to have their propagation vectors be orthogonal to a screen with holes and directed towards the holes in the screen. If the holes are large compared to the beam widths it's like not having a screen with holes at all. If the holes are small then D0 will fire less often as the screen absorbs some photons. So you can reproduce those aspects of a standard double slit setup.
But the two beams are not arriving in phase and each is really an entangled superposition of different polarizations. So you can't expect a double slit there to work exactly the same as it would in a normal double slit set up.
Now normally in quantum mechanics you can track the lines of probability current and even make dynamical equations for them. If you do that you see that absorbing the edges of a beam makes the surviving (new) edge share out more.
So the new double slit will flare the beams more. The troughs weren't identically zero since the original slits were finite sized as well as finitely spaced (each beam had sine thickness). But more than this size and wavelength away from the central region of D0 the new double slit is now more spread out so you should detect more peaks and troughs. They happen because of the difference in part length from red and blue. There are multiple subgroups with different locations for peaks and troughs because the red and blue beams don't have a constant phase difference because of the entanglement with other beams.
Therefore, in case of the signal photon hitting an area of the screen it could not possibly hit when interfering with itself (gaps on a interference pattern), we would know for certain that the which path is known 8ns beforehand with just a single photon pair (signal/idler), in this special case?
The frequency histogram at D0 is the sum of the histograms for R1, R2, R3, and R4. And R3 and R4 have one central peak each, offset from each other since the red and blue aim for different places. And R1 and R2 have peaks in the other ones troughs and vice versa.
When you see a hit in the trough of R1 you now know it is much more likely that D1 does not go off 8ns later.
This question was cross-posted to physics forums word for word. I'll give the same basic answer I gave there.
Consciousness is never part of any quantum mechanical explanation. Every experiment runs the same whether or not a person is in the room.
Retrocausality is also not required here. For example, the Copenhagen interpretation explains the delayed choice eraser with instantaneous non-local partial collapse and the many worlds interpretation explains it with worlds staying coherent and interfering. Those are the two most popular interpretations.
Thinking of the delayed choice eraser in terms of an optical experiment muddles the issue, in my opinion. We can create the same basic effect with a much simpler system, involving three qubits.
Analogous Simpler Situation
Suppose you have the state $\psi = \frac{1}{2} \left|000\right\rangle + \frac{1}{2} \left|110\right\rangle + \frac{1}{2} \left|011\right\rangle + \frac{1}{2} \left|101\right\rangle$. That is to say: you have three qubits, the first two qubits are each initialized into the half-and-half state $\frac{1}{\sqrt{2}} \left|0\right\rangle + \frac{1}{\sqrt{2}} \left|1\right\rangle$, and then the third qubit is conditionally toggled so that its value tells you whether the first two qubits differ or not.
Now, run some bell tests with the first two qubits. You'll find that they don't violate any bell inequalities, and fail any other test of entanglement. They aren't entangled.
But, if you later measure the third qubit, and split the tests you did on the other two qubits into a "third qubit was 0" group and a "third qubit was 1" group, you'll see that within each group there are bell inequalities being violated! So the first two qubits were entangled all along.
BUT, if you measure the third qubit along the X axis instead of the Z axis we've been working with, then you'll never be able to split the two groups apart and see the entangled sub-cases. The distinguishing information becomes permanently inaccessible, unrecoverable due to thermodynamics stopping you from reverting the measurement.
So which is it? Were they entangled? Not entangled? Only entangled when we made the right measurement? I would say that they are entangled, but in an unusual way that's harder to detect. The third qubit tells you what type of entanglement exists between the first two qubits (entangled to agree, or entangled to disagree). Each subcase is entangled, but the cases are complementary in a way that hides any signal of entanglement if you count them together instead of individually.
Whether or not we choose to measure the correct axis of the third qubit doesn't determine whether the original two qubits are entangled or not, it determines whether we have the information needed to split the results into the two complementary sub-cases. If you try to simplify the situation into just "two-particles-maximally-entangled" vs "not-entangled", or into "is-just-a-particle" vs "is-just-a-wave", you're throwing away the context needed to understand what's going on.
Mapping Back
The exact same logic applies to the delayed choice quantum eraser experiment, except there's an extra value involved and you're looking for interference patterns instead of passing bell tests. No consciousness. No retrocausality. Just "did we get and use the distinguishing information needed to group the lack-of-interference pattern into two complementary interference patterns"
Best Answer
The short answer is that you're right. The delayed choice experiment doesn't require backwards-in-time shenanigans, and all the pop-science articles implying this to be the case are basically garbage.
For example, here is the example delayed choice eraser circuit from my quantum circuit simulator Quirk:
The green boxes are state displays that show the probability of each possible measurement result, optionally conditioned on some other qubits' possible measurement results.
The top wire is the "choice" qubit. The second wire is the "which slit" qubit. The rest of the wires are the "where did it hit on the screen" qureg. The first two operations setup some entanglement between the which-slit qubit and the screen qureg.
The four displays on the right show that, if you group the screen measurements by the choice qubit and the which-slit qubit, then within the groupings you will see an interference pattern if and only if the choice qubit is ON.
But correlation goes both ways. Instead of thinking about how the choice qubit and which-slit qubits predict the screen measurement, we can think about how the screen measurement predicts the which-slit qubit's state.
It would be tedious to set up $2^7$ bloch sphere displays, each conditioned on a different screen measurement. Instead, let's use a single condition but cycle through offsets to the screen measurement. This circuit makes it very clear that the landing position on the screen is correlated with different states of the which-slit qubit:
This diagram also answers your main question:
Yes. The value analogous to the D1-vs-D2-given-position likelihood is shown in that green box in the top right of the diagram. The chance is changing as we focus on different positions.
Notice that the qubit is spinning like crazy regardless of whether we applied the $X^{1/2}$ rotation that will be controlled by the choice qubit. All the $X^{1/2}$ rotation does is switch which axis the qubit is spinning around as I vary the screen measurement position being focused on. If there is no $X^{1/2}$ rotation, the spinning goes around the measurement axis, and so doesn't affect the probability of measuring ON-vs-OFF (i.e. we've picked a rather poor measurement axis). But if the $X^{1/2}$ is applied, then the spinning is around the Y axis instead of the Z axis, and so does translate into changes in computational-basis measurement probability.
So here is a forward-in-time collapse interpretation of the delayed choice experiment: