It all boils down to how much information there is to, in principle, distinguish which way the photon went from the final state of the beam splitter, as encoded in the overlap between its two possible final states. The interference is destroyed because the photon gets entangled with the beam splitter, and the amount of entanglement depends on this overlap.
Say, then that if the photon goes straight through the beam splitter, to state $|{\to}\rangle$, the beam splitter stays put, at state $|0\rangle$, whereas if the photon gets deflected into state $|{\downarrow}\rangle$, the beam splitter gets some upwards momentum, $|{\Uparrow}\rangle$. If the result is a superposition, then, the total state of the system is entangled:
$$|\Psi\rangle=|\to\rangle|0\rangle + |{\downarrow}\rangle|{\Uparrow}\rangle .$$
Regardless of what you do to the beam splitter - i.e. measure its state or just forget about it - in the absence of a measurement that introduces further interactions, the information you have available to produce an interference pattern on the photon side is given by the reduced density matrix obtained by taking the partial trace over the beam splitter.
Calculating this object is fairly simple. In the $\{|{\to}\rangle, |{\downarrow}\rangle\}$ basis, it is given by
$$
\rm{Tr}_{\rm{BS}}(|\Psi\rangle\langle\Psi|)
=
\begin{pmatrix}
1 & \langle0|{\Uparrow}⟩ \\ \langle{\Uparrow}|0⟩&1
\end{pmatrix}.
$$
If the beam splitter states are completely distinguishable, then they are orthogonal and what you get on the photon side is a completely mixed state, $|{\to}⟩⟨{\to}|+|{\downarrow}⟩⟨{\downarrow}|$, which is completely classical, and from which no interference can be extracted. Note that this happens regardless of whether you actually measure the beam splitter's momentum or not.
If there is no effect on the beam splitter, on the other hand, the states are the same, and the photon's density matrix corresponds to a pure state, $\left(|{\to}⟩+|{\downarrow}⟩\right)\left(⟨{\to}|+⟨{\downarrow}|\right)$. Then you will see complete interference, but you will have no "which way" information available, even in principle.
In any physical realization, of course, you're somewhere in the middle. Most realizations have very similar states for the beam splitters, which means that $\langle{\Uparrow}|0\rangle$ is very close to 1, and you get good interference, but as the states become more distinguishable, the contrast in the interference fringes is reduced.
I understand this can feel pretty thin. After all, how are we to know that we've eliminated all possible places where "which way" information may in principle be available? This is in fact how it goes down in the lab, and that's the reason observing things like Mandel dips is very, very touchy: if you want two photons to interfere, you need to make sure that they truly are indistinguishable - in spatial profile, displacement, spectrum, and timing - for otherwise there will be (possibly undetected) entanglement with some other mode, and that will reduce or destroy your interference contrast.
The answer is no, you don't get any interference rings if both arms of the Michelson are identical and the beamsplitter is perfectly flat.
The answer with real optics is somewhat more complicated. Real laser beams have curved wavefronts (as opposed to plane) whose curvature changes as they propagate. If you take two laser beams of the same frequency but different curvatures and overlap them on a camera, then you will get interference rings like shown in your wikipedia diagram, two laser beams of the same curvature give an interference pattern with no rings.
In a Michelson interferometer the things which could cause the two returning beams to have different curvatures are numerous but the most common ones are: path length differences between the arms, curvature of the two end mirrors, or curvature of the beamsplitter.
Best Answer
Have a look at my answer to Slit screen and wave-particle duality because this covers a lot of topics relevant to your question.
You're correct that if we imagine the photon as a little ball then if the arms of the interferometer are different lengths the two "halves" of the little ball cannot arrive at the detector at the same time. But this is not how the interference works. It does not work by the photon splitting and its two halves then interfering with each other.
To imagine the light as a little ball (the photon) ricocheting around your interferometer is highly misleading. The behaviour of the light is best explained by quantum field theory, but sticking to regular quantum mechanics we'd have to say that until we interact with the light (e.g. by it hitting a CCD or photographic plate) the light is delocalised over your whole interferometer.
This doesn't mean the photon has a position but we don't know it, it means the light simply has no position in the classical sense. The wavefunction that describes it covers your whole experimental equipment. The probability of detecting the photon at some point in your kit is given by the magnitude squared of its wavefunction at that point. If you change the geometry of your interferometer then you will change the wavefunction and therefore change the probability of detecting the photon at any particular point.