The way I see it, there are four possible answers. You can pick the one you like the most, because in the end it doesn't matter very much.
You're right, it's a second order approximation and those who say it's first order are making a terminology mistake.
When we say first order, we really mean first non-null order, since the linear term vanishes.
It's actually first order in $v^2$.
It doesn't really matter. We all know what the non relativistic approximation is, its properties are not going to change if we call it by a different name.
Personally I support answer 4, and I suggest you get used to it because physics is not known for its rigor and formality.
This bothered me as an undergraduate too, and it wasn't until much later that I found out the rigorous mathematical way of understanding perturbation theory. The basic answer is that "we are restricting our attention to perturbations with this property." For the details, my answer will be loosely cribbed from Wald's General Relativity, which gives a better mathematical description of what we're really doing here than most classical mechanics texts do.
Let's suppose we want to solve a differential equation $\mathcal{E}[q_i(t)] = 0$, where $\mathcal{E}$ stands for some non-linear operator on the functions $q_i(t)$. Let us assume that there exists a family of exact solutions $q_i(t; \lambda)$ to the equations of motion, parameterized by a parameter $\lambda$, with the following properties:
- For all $\lambda$, $\mathcal{E}[q_i(t; \lambda)] = 0$;
- $q_i(t; 0) = q_{0i}(t)$, where $q_{0i}(t)$ is our "background solution"; and
- $q_i(t; \lambda)$ depends smoothly $\lambda$ and $t$.
In some sense, $\lambda$ measures the "size" of the perturbation away from the background solution. In particular, since all of our solutions are exact, we can say that
$$
\left.\frac{d}{d\lambda} \mathcal{E}[q_i(t; \lambda)] \right|_{\lambda = 0} = 0,
$$
and it is not too hard to see that this equation will be a linear equation in the functions
$$
\gamma_{i}(t) \equiv \left. \frac{d q_i(t;\lambda)}{d\lambda}\right|_{\lambda = 0}.
$$
For sufficiently small $\lambda$, the quantity $q_{0i}(t) + \lambda \gamma_i(t)$ will be a good approximation to $q_i(t;\lambda)$, allowing us to study solutions that are "close" to our background solution. The $\eta_i(t)$ used by Goldstein would be equal to $\lambda \gamma_i(t)$ in this language.
Once you have put all of this into place, then it is fairly straightforward to show that $\dot{\eta}_i(t)$ must go to zero as $\lambda \to 0$, since
$$
\frac{d \eta_i}{dt} = \frac{d}{dt} \left[ \lambda \left. \frac{d q_i(t, \lambda)}{d \lambda} \right|_{\lambda = 0}\right] = \lambda \left.\frac{d^2 q_i(t,\lambda)}{dt d\lambda} \right|_{\lambda = 0}
$$
and the smoothness assumption on the family $q_i(t;\lambda)$ ensures that this second derivative exists.
To see why the smoothness assumption is what saves us here from the pathology you're considering, consider the one-parameter family of functions
$$
f(t; \lambda) = \lambda \sin (t/\lambda).
$$
It is certainly the case that as $\lambda \to 0$, we have $\eta \to 0$ but $\dot{\eta} \not\to 0$. But this family of functions is not smooth in $\lambda$, since
$$
\frac{df}{d\lambda} = \sin(t/\lambda) - \frac{t}{\lambda}\cos(t/\lambda)
$$
and this is not well-defined at $\lambda = 0$. The assumption of a "smooth family of solutions" means that we have eliminated such pathology in the first place, though, and so we don't have to worry about such cases when we're trying to linearize our equations.
Best Answer
The difference arises because you are confusing the three spatial components of the 4-momentum, which are in fact $p^j = \gamma\,m\,v^j$ with the three components of the Newtonian momentum $m\,v^j$. If you write:
$$\frac{p^2}{m^2\,\gamma^2} = v^2$$
and solve for $v$ in terms of $p$, then substitute back into the Watkin's lecture notes, you'll get your factor of $\frac{1}{8}$ in the quartic term. That's really messy. It's much easier to begin with the equation for the momentum four-vector's square norm, which I always remember as $\frac{E^2}{c^2} - p^2= m^2\,c^2$, then write $E = T + m\,c^2$ and solve for $T$ in terms of $p$.
Effectively what you've done is to use the wrong power of $\gamma$ in your expressions as a result of what seems to be your confusion.
The right (and messy) way to derive the same expression given in the Watkin's lectures and using $T = (\gamma-1)\,m\,c^2$ is most succinctly described by the following Mathematica statement:
Series[((1/Sqrt[1 - (v/c)^2]) - 1) m c^2 /. Solve[p == m v/Sqrt[1 - (v/c)^2], v], {p, 0, 4}] // Normal
in a friendly conversation with Ms Mathematica Kernel, to which she obligingly replies:
$$\left\{\frac{p^2}{2 m}-\frac{p^4}{8 c^2 m^3},\frac{p^2}{2 m}-\frac{p^4}{8 c^2 m^3}\right\}$$