The elements of the gauge transformations belong to a gauge group. In physics, it's most typically $SU(N)$ (both the electroweak theory, with its $SU(2)$, and the QCD for quarks, $SU(3)$, use these $SU(N)$ groups; $U(1)$ we first learn in electromagnetism – but we must reinterpret the charge as the "hypercharge" when we study the electroweak theory – is the only extra addition we need for the Standard Model). It's a group of all complex $N\times N$ matrices $M$ that obey
$$MM^\dagger=1, \quad \det M = 1$$
Note that $M^\dagger=(M^*)^T$ is the Hermitian conjugate; the first condition makes the matrix "unitary", therefore $U$. The determinant of a unitary matrix could be any complex number whose absolute value equals one. The second condition says that the determinant must be one and nothing else, that's the "special" or $S$ condition in $SU(N)$.
The gauge field transforms as
$$ A_\mu \to M(A_\mu+ie\partial_\mu) M^\dagger$$
up to different conventions. That's needed for the covariant derivative $D_\mu$ to transform nicely. Forget about the complicated formula above. The point is that $A_\mu$ takes values in the Lie algebra of the Lie group.
In other words, you may imagine an infinitesimal transformation – infinitely close to the identity – in the gauge group, e.g. $SU(N)$. Assume
$$ M = 1+i\epsilon G $$
The factor $\epsilon$ makes it infinitesimal, the factor of $i$ is a convention popular among physicists but omitted by mathematicians (physicists like things to be Hermitian, without $i$, they would have to be anti-Hermitian).
Here, $G$ is the kind of $N\times N$ matrix that the gauge field can have as a value.
Now, substitute this Ansatz for $M$ into the conditions $MM^\dagger=1,\det M=1$. You may neglect $\epsilon^2$ "very small" terms and the conditions become
$$1+i\epsilon G - i\epsilon G^\dagger = 1, \quad \det(1+i\epsilon G) = 1$$
Mathematics implies that these conditions are equivalent to
$$ G = G^\dagger, \quad {\rm Tr}(G) =0 .$$
To get the first one, I just subtracted $1$ from both sides and cancelled $i\epsilon$. To get the latter, I used the "sum of products over permutations" formula for the determinant and noticed that only the product of the diagonal entries contributes $O(\epsilon)$ terms and they're proportional to the sum of the diagonal entries, the trace.
At any rate, you should try to understand this maths and its conclusion is that the Hermiticity of the generator $G$ – matrices that are combined with various real coeffcients to get $A_\mu$ – is equivalent to the gauge group's being unitary; and the tracelessness is equivalent to the group's being "special" i.e. requiring the unit determinant.
It's perhaps useful to mention why $SU(N)$ is considered the "simplest" class of gauge groups. The $S$ has to be there because $U(N)$ isn't simple – it's pretty much isomorphic to $SU(N)\times U(1)$ where the two factors could be treated separately and we want to work with the smallest allowed pieces of gauge groups which are $SU(N)$ and $U(1)$. And $SU(N)$ is more "elementary" than $SO(N)$ or $USp(2N)$ because complex numbers are more fundamental in group theory (and physics) than real numbers or quaternions. In fact, the groups $SO(N)$ and $USp(2N)$ may be defined as $SU(N)$ with some "extra structure" (orientifolds) added which makes some natural group-theoretical analyses somewhat more convoluted than those for $SU(N)$. But one may still say that the Lie algebra for $SO(N)$ would be composed of antisymmetric real (or antisymmetric pure imaginary, depending on the conventions concerning factors of $i$) matrices, in analogy with the Hermitian matrices above; they're automatically traceless.
You really should split your question. I will answer the part where you do not understand how counting of degrees of freedom work.
Basically we count the number of propagating (physical) degrees of freedom per point of spacetime. Of course, the total number of degrees of freedom is infinite because spacetime is continuous and has an infinite number of points, but to ask for the number of degrees of freedom per spacetime point is a reasonable demand to make. Bear in mind that we only care about physical degrees of freedom by which we mean those that can be properly normalized.
You correctly state that photons can be off-shell but they are only those involved in internal processes. External photons are always on-shell. Moreover, gauge invariance is a physical property. External fields which you measure in your laboratory should be independent of your chosen gauge. In other words, the S-matrix should be gauge-invariant. On the other hand, there is nothing that stops me from having gauge-broken internal processes if ultimately I can make the S-matrix gauge-invariant. Therefore, the word "physical" should almost always give you a picture of external on-shell gauge-invariant quantities.
So yes, gauge redundancy kills one degree of freedom, and when we are talking about propagating physical degrees of freedom, one more is killed on-shell. You have to understand how that happens. It is not that every time you see an equation of motion, a degree of freedom is killed. Killing of degrees of freedom requires an elaborate process of imposing constraints on the equation of motion known as gauge-fixing. And this has to be done on a case by case basis.
For example, consider the four equations of motion (separated into temporal and spatial sets) for the massless photon $A^\mu = (\phi, \vec A)$ describing four on-shell degrees of freedom as follows.
\begin{align*}
-\Delta \phi + \partial_t \vec\nabla\cdot\vec A = 0\,,\\
\square \vec A - \vec\nabla(\partial_t\phi-\vec\nabla\cdot\vec A) = 0\,.\\
\end{align*}
Since these equations exhibit a gauge symmetry $A_\mu \to A'_\mu := A_\mu + \partial_\mu \alpha_1(x)$, we can try to fix the gauge by choosing $\alpha_1$ such that, for instance, it is a solution of $\square \alpha_1 = -\vec\nabla\cdot\vec A$, giving us
\begin{align*}
\Delta \phi' = 0\,, \\
\square \vec A' - \vec\nabla\partial_t\phi' = 0\,. \\
\\
\vec\nabla\cdot\vec{A}'=0\,.\\
\end{align*}
We have selected a divergence-free field, the so-called Coulomb gauge. Under this choice, the electric potential becomes non-propagating, that is there are no kinetic terms in the Lagrangian for it (observe that $\Delta \phi' = 0$ does not have any time derivatives).
In momentum space, this gauge condition reads $\vec p \cdot \vec \epsilon = 0$ where $ \vec \epsilon$ is the polarisation vector (Fourier transform of the magnetic potential). There are three solutions to this constraint. Choosing a frame in which $p^\mu = (E,0,0,E)$, we find that the three polarisation vectors are
$$ \epsilon^\mu_1 = (0,1,0,0), \qquad \epsilon_2^\mu=(0,0,1,0), \qquad \epsilon_t^\mu = (1,0,0,0) $$
The third polarisation is time-like and therefore cannot be normalized. It is unphysical, and we have to get rid of it. Luckily, the gauge symmetry is not exhausted. There are more available choices of gauge transformations which preserve the Coulomb gauge $\vec p \cdot \vec \epsilon = 0$. For example, we could go from $A'_\mu \to A_\mu:= A'_\mu + \partial_\mu \alpha_2(x)$ such that $\Delta \alpha_2 = 0,\ \partial_t \alpha_2 = - \phi'$ which preserves the divergence and sets $\phi = 0$.
Note that this time we have to make sure that this gauge transformation happens on-shell, namely that $\Delta \phi = 0$, otherwise this gauge-fixing will be inconsistent because $\Delta \alpha_2 = 0 \Rightarrow$ $0 = \Delta \partial_t\alpha_2 = - \Delta\phi' \ne 0$ off-shell. In other words, requiring $\phi = 0$, or equivalently $\epsilon^0 = 0$, in order to get rid of unphysical degrees of freedom requires us to be on-shell.
To summarize, we made an off-shell gauge choice $\vec p \cdot \vec \epsilon = 0$, an on-shell gauge choice $\epsilon^0 = 0$ and our equation of motion became $p^2 = 0$. Having exhausted our gauge choices, we find only two physical polarization modes or degrees of freedom.
Now, you understand that merely having an equation of motion does not eat up a degree of freedom. To find the correct number of degrees of freedom, keep on making gauge choices (producing independent constraint equations), some off-shell and some on-shell, until you exhaust your gauge freedom. Then check how many degrees of freedom you are left with. If you notice any unphysical guy showing up, most likely you haven't used up all your gauge freedom and you still have enough flex to shoot this guy dead. Then, count all that you are left with. That's your answer.
Best Answer
First of all, Sean Carroll is a relativist so his treatment of the diffeomorphism symmetry as a gauge symmetry should be applauded because it's the standard modern view preferred by particle physicists – its origin is linked to names such as Steven Weinberg, it is promoted by physicists like Nima Arkani-Hamed, and naturally incorporated in string theory so seen as "obvious" by all string theorists. In this sense, Carroll throws away the obsolete "culture" of the relativists. There are some other "relativists" who irrationally whine that it shouldn't be allowed to call the metric tensor "just another gauge field" and the diffeomorphism group as "just another gauge symmetry" even though this is exactly what these concepts are.
Second of all, a symmetry expressed by a Lie algebra can't be "discrete", by definition: it is continuous. Lie groups are continuous groups; it is their definition. And only continuous groups are able to make whole polarizations of particles unphysical. It's plausible that a popular book replaces the continuous groups by discrete ones that are easier to imagine by the laymen but this server is not supposed to be "popular" in this sense.
Third, when you say that if $U$ is unitary, the generator has to be Hermitian and traceless, is partly wrong. Unitarity of $U$ means the hermiticity of the generators $T^a$ but the tracelessness of these generators is a different condition, namely the property that $U$ is "special" (having the determinant equal to one). The tracelessness is what reduces $U(N)$ to $SU(N)$, unitary to special unitary.
Fourth, and it is related to the second point above, "charge conjugation" isn't any gauge principle of electromagnetism in any way. Electromagnetism is based on the continuous $U(1)$ gauge group. This group has an outer automorphism – a group of automorphisms is ${\mathbb Z}_2$ – but we're never putting these elements of the discrete group into an exponent.
Fifth, similarly, QCD isn't based on the discrete symmetry of permutations of the colors but on the continuous $SU(3)$ group of special unitary transformations of the 3-dimensional space of colors. Because none of the things you wrote about the non-gravitational case was quite right, it shouldn't be surprising that you have to encounter lots of apparent contradictions in the case of gravity as well because gravity is indeed more difficult in some sense.
Sixth, $SO(3,1)$ isn't related to the diffeomorphism in any direct way. It is surely not the same thing. This group is the Lorentz group and in the GR, you may choose a formalism based on tetrads/vielbeins/vierbeins where it becomes a local symmetry because the orientation of the tetrad may be rotated by a Lorentz transformation independently at each point of the space. But this is just an extra gauge symmetry that one must add if he works with tetrads – it's a symmetry that exists on top of the diffeomorphism symmetry and this symmetry is different and "non-local" because it changes the spacetime coordinates of objects or fields while all the Yang-Mills symmetries above and even the local Lorentz group at the beginning of this paragraph are acting locally, inside the field space associated with a fixed point of the spacetime. (The fact that diffeomorphisms in no way "boil down" to the local Lorentz group is a rudimentary insight that is misunderstood by all the people who talk about the "graviweak unification" and similar physically flawed projects.) I will not use with tetrads in the next paragraph so the gauge symmetry will be just diffeomorphisms and there won't be any local Lorentz group as a part of the gauge symmetry.
The diffeomorphism symmetry is locally generated by the translations, not Lorentz transformations, and the parameters of these 4-translations depend on the position in the 4-dimensional spacetime. This is how a general infinitesimal diffeomorphism may be written down. If there were no gauge symmetries, $g_{\mu\nu}$ would have 10 off-shell degrees of freedom, like 10 scalar fields. However, each generator makes two polarizations unphysical, just like in the case of QED or QCD above (where the 4 polarizations of a vector were reduced down to 2; in QCD, all these numbers were multiplied by 8, the dimension of the adjoint representation of the gauge group, $SU(3)$ etc.). Because the general translation per point has 4 parameters, one removes $2\times 4 = 8$ polarizations and he is left with $10-8=2$ physical polarizations of the gravitational wave (or graviton). The usual bases chosen in this 2-dimensional physical space is a right-handed circular plus left-handed circular polarized wave; or the "linear" polarizations that stretch and shrink the space in the horizontal/vertical direction plus the wave doing the same in directions rotated by 45 degrees:
This counting was actually a bit cheating but it does work in the general dimension. To do the counting properly and controllably, one has to distinguish constraints from dynamical equations and see how many of the modes of a plane wave (gravitational wave) are affected by a diffeomorphism. In the general dimension of $d$, it may be seen that the tensor $\Delta g_{\mu\nu}$ may be described, after making the right diffeomorphism, by $h_{ij}$ in $d-2$ dimensions and moreover the trace $h_{ii}$ may be set to zero. This gives us $(d-2)(d-1)/2-1$ physical polarizations of the graviton. In $d=4$, this yields 2 physical polarizations of the graviton. A gravitational wave moving in the 3rd direction is described by $h_{11}=-h_{22}$ and $h_{12}=h_{21}$ while other components of $h_{\mu\nu}$ may be either made to vanish by a gauge transformation (diffeomorphism), or they're required to vanish by the equations of motion or constraints linked to the same diffeomorphism. Morally speaking, it is true that we eliminate two groups of 4 degrees of freedom, as I indicated in the sloppy calculation that happened to lead to the right result. Note that $$\frac{d(d+1)}2 -2d = \frac{(d-2)(d-1)}2-1 $$ I have to emphasize that these is a standard counting of the "linearized gravity" and it's the same procedure to count as the counting of physical polarizations after the diffeomorphism "gauge symmetry" – just the language involving "gauge symmetries" is more particle-physics-oriented.