(1) The completeness relationship for a basis of vectors orthonormal with respect to $\eta_{\mu\nu}$ is
\begin{equation}
\eta_{ij}\epsilon^{(i)}_\mu \epsilon^{(j)}_\nu = \eta_{\mu\nu}
\end{equation}
This normalization convention is picked for Lorentz invariance... I know you said you didn't want that answer but the point is that the normalization of these vectors is a matter of convention and it's best to pick a Lorentz invariant one. One advantage of choosing a L.I. normalization is that we don't need to specify the argument: the $\epsilon$ depend on the momentum, but these normalization conditions do not. The $\eta_{ij}$ provides the minus sign you are missing. Also here you see the basic problem that the gauge symmetry fixes: one of the polarization vectors necessarily has a negative norm.
(2) Having said that, $\epsilon_\mu^{0}$ and $\epsilon_\mu^{3}$ are not valid on shell quantities. They are a convenient mathematical fiction, needed to make an orthonormal basis, which allows things to be written in a nice, Lorentz invariant way. But the external legs of Feynman diagrams must be on shell, and as a result you can only put real honest on shell polarization vectors there, and so you aren't allowed to put $\epsilon^{(0,3)}$ there at all. Put another way, you can't satisfy the equations of motion for the photon with the longitudinal and time like modes, but the LSZ formula picks out the external wave functions that satisfy the classical equations of motion. However, since $k_\mu \mathcal{M}^\mu=0$, you could add $0$ in the funny combination $\left(\epsilon^{(0)}_\mu-\epsilon^{(3)}_\mu\right)\mathcal{M}^\mu$, which you can then add to your other basis vectors to form $\eta_{\mu\nu}\mathcal{M}^{*\mu}\mathcal{M}^\nu$ when you square to form the probablity. If the hypocricy of this angers you, that is a natural reaction, you'll eventually just accept it. (Welcome to gauge theory).
(3) EXCELLENT question. You need the off shell formulation of the Ward identity to give a real answer to this, that's in chapter 7 of P&S. Basically there's more to it than just "replace the external polarization vector by $k_\mu$", you can really show that the parts of the propagator proportional to $k_\mu k_\nu$ never matter even in loops. However, in Yang Mills theories the corresponding statement is not true! So your question is exactly on the money for Yang Mills theories, you get contributions in loops from the longitudinal and timelike modes, and by the optical theorem this taken at face value would lead to the production of unphysical particles. The fix is to add yet more unphysical particles to the theory to cancel out these parts of the loop diagrams, they are called Fadeev Popov ghosts.
After flipping through Peskin and Schroder to answer this question, I have to say that they are proving things in a very roundabout way. It's good that it teaches how to think about Feynman diagrams in a very detailed way... But there are other, less painful ways to prove and think about the Ward Identity (such as using the path integral).
There are two related quantities we might want to compute here:
- The amplitude $\mathcal{A}$ for a particular process.
- The total decay rate of the initial particle.
When computing an amplitude, we must specify precisely which initial state we start with and which final state we end up with. So we can talk of the amplitude that a particle with momentum $p$ decays into two particles with momenta $q_1$ and $q_2$ and polarizations $\lambda_1$ and $\lambda_2$, but we can't talk of the amplitude that a given particle decays into some other particle.
However, we do want to know such things as the probability that a given particle decays into some other particle (in a given time) – this is the total decay rate. To find this, we consider all possible specific processes that would lead to such a decay, calculate the amplitude for each, square each of these amplitudes, and then sum the squares. This final summation is just the ordinary rule for adding probabilities of mutually exclusive events.
The amplitude for the process that a particle with momentum $p$ decays into two particles with momenta $q_1$ and $q_2$ and polarizations $\lambda_1$ and $\lambda_2$, is, by your Feynman rules,
$$ \mathcal{A}(q_1,q_2,\lambda_1,\lambda_2) = i g_a \varepsilon_{\mu \nu \sigma \tau} q_1^\sigma q_2^\tau \epsilon^{(\lambda_1)\mu*}\epsilon^{(\lambda_2)\nu *} \,.$$
We now want to square this amplitude, and then sum over the possible momenta and polarizations of the final state. We also need to include a momentum-conserving delta function. In the rest frame of the initial particle, this fixes the 3-momenta of the final state particles to be equal and opposite, and fixes also the magnitude of this momentum. We are hence left with a sum over only the direction of emission of one of the particles. The decay rate is then given by
$$ \Gamma \propto \sum_{\lambda_1}\sum_{\lambda_2} \int \mathrm{d}\Omega\,|\mathcal{A}(q_1,p-q_1,\lambda_1,\lambda_2)|^2 \,.$$
From here it should become clearer how to make use of the two hints given to you.
Best Answer
I just realized that the answer is just stupid. In the case of two outgoing photons when computing $M^*$ you get the epsilons (not complex conjugated) that in the full $|M|^2$ will allow you to use $\sum\epsilon_{\mu}\epsilon^*_{\nu}\to-\eta_{\mu\nu}$