The problem of the reasoning has been stated in the comment.
Let's see how to deduce it correctly (by the way, the similar argument has already given in Colding & Minicozzi's minimal surface book), I just mimic it.
Consider the embedding $F:\Sigma \times (-\epsilon, \epsilon) \to \Bbb{R}^{N}$ that maps $x\mapsto x- H(x)t\ $ with image $F(\Sigma , t) = \Sigma_t$, therefore the manifold $\Sigma _t$ isometric isomorphic to $(\Sigma,(F(\cdot,t))^*(\bar{g}))$ where $F(\cdot, t): \Sigma\to \Bbb{R}^N$ and $\bar{g}$ is standard Euclidean metric. therefore the new metric on $\Sigma$ is $$g_{i,j}(t) = \bar{g}(dF(\partial_i),dF(\partial_j))$$
therefore we have under the local coordinate $$\int_{\Sigma_t} u(x,t) \ dV_{t} = \int_{\Sigma} u \sqrt{\det{g_{ij}(t)} \det{g^{ij}(0)}}\sqrt{\det(g_{ij}(0))} dx^1\wedge...\wedge dx^n$$
denote $\nu(t) = \sqrt{\det{g_{ij}(t)} \det{g^{ij}(0)}}$Now use the fact in the book referenced above page 7, we have
$$\frac{d}{dt}\nu(t) = \text{div}_{\Sigma}(F_t) = \text{div}_{\Sigma}(-H)$$
therefore $$\frac{d}{dt}\int_{\Sigma_t}u = \frac{d}{dt}\int_{\Sigma} u\nu dV_0 = \int_{\Sigma} u_t \nu + \frac{d}{dt}\nu u dV_0 = \int_{\Sigma_t} u\ dV_t - \int_{\Sigma}|H|^2 u$$
For the global case, just use the partition of unity which finish the proof.
And I'm very confused: which hypersurface do they really care about, $M$ or $M_t$.
In some sense, this is really the wrong question to ask. They care about a family of Riemannian manifolds: one for each value of $t$. It is up to you how you decide to implement this. Then, for extrinsic questions, you would have to specify how these Riemannian manifolds are to ‘sit inside’ another Riemannian manifold. But, I’ll elaborate my view.
Since $F$ is a 1-parameter family of immersions, it is often easier to formulate things using $M$ and the corresponding Riemannian metric $g_t:=F_t^*g$, i.e you consider the 1-parameter family of Riemannian manifolds $(M,g_t)$. So, you keep the same underlying smooth manifold, while you vary the metric tensor. Keep in mind that the ‘geometry’ (i.e lengths and angles) is all encoded in the metric.
You mention Lee’s proposition 5.18, so let me reiterate what I already mentioned in the comments. Because you’re dealing with immersions, you better be careful with the topology. The image set $M_t:=F_t[M]$ is not necessarily an embedded submanifold of $N$ (i.e the slice-chart condition could fail), and since almost everyone is accustomed to by default think of equipping a subset (here $M_t$) of a topological space (here $N$) as being equipped with the induced subspace topology, you’re going to have to do lots of mental gymnastics if you try to visualize everything using $M_t$ (of course this can help, but don’t be misled by ‘intuition’). This is why to avoid issues, it is simpler to fix the underlying smooth manifold to be $M$, and let all the dependence be encoded in the map (namely $F$). This viewpoint is also convenient is you want to do derivatives or really calculus of any sort. Also if for whatever reason you don’t require the immersions to be injective then that seems to me to be all the more reason to avoid looking at/formulating things on the image set (apart from having a vague visualization… but really our visualization corresponds to the scenario when we have a 1-parameter family of embeddings).
Having said all of this, you of course don’t need to do things this way. Suppose we have a 1-parameter family of injective immersions. The map $F_t:M\to N$ restricts to a bijection $f_t:M\to M_t$, and the domain is a smooth manifold, so by transport of structure, you can equip the target $M_t$ with a topology and smooth structure such that $F_t$ becomes a diffeomorphism. Then, the inclusion $\iota_t:M_t\to N$ becomes an injective immersion (since $\iota_t=F_t\circ f_t^{-1}$ is the composition of an injective immersion with a diffeomorphism). So, yes you can consider the Riemannian manifold $(M_t,\iota_t^*g)$. The map $f_t$ is an isometry of the Riemannian manifold $(M,g_t)$ onto $(M_t,\iota_t^*g)$, so geometrically everything is preserved (but again, I can’t stress this enough, $M_t$ has a different topology in general than the subspace topology induced from the ambient $N$, so be very careful with ‘intuition’). In other words, every geometric statement you formulate in one way, can equivalently be rephrased using another simply by composing/pulling back/pushing forward by $f_t$ (or its inverse) appropriately (this remark applies to your edit as well).
Let me now make some other remarks. An immersion is ‘locally’ an embedding (be very careful with what this does and does not mean; Lee is careful to point out the differences), so as long as you’re only dealing with local questions in topology/geometry, it doesn’t matter which you decide to take as your underlying set. You can always formulate things one way or the other (though like I said above, for calculations, it is much more convenient to fix the set, and allow the map to vary). This is also why when talking about the extrinsic curvature (second fundamental form) although one starts out generally by only considering an immersion of one Riemannian manifold into another, we often pretend it is actually an embedded submanifold (again Lee makes comments about this in his Riemannian geometry text).
Finally, if we’re in the special case where we have a 1-parameter family of embeddings, then for visualization purposes we can freely consider the image sets $M_t=F_t[M]$ and we don’t have to worry about any caveats. Stating theorems and understanding the statements is very intuitive in this setting. But again, apart from visualization, for concrete computations, it’s better to fix the set and transfer everything into the map.
Honestly, I don't know what a volume measure is.
See the comment for your previous question about this. Again, you can pose the same question as to where the measure $\mu_t$ is defined, either $M$ or $M_t$. Well, by means of the isometry $f_t:M\to M_t$, you can go back and forth between any formulation. Though, once again the same remarks apply regarding visualization vs computation.
Best Answer
The following formula is well known and probably found in most expositions of minimal (hyper)surfaces: Let $M$ be a manifold with Riemannian metric $g_M$ and $\Sigma$ be an $(n-1)$-dimensional manifold. Let $\Phi_t: \Sigma \rightarrow M$ be a smooth family of embeddings and $g_t = \Phi_t^*g_M$ be the family of induced Riemannian metrics such that at $t=0$, $$ \partial_t\Phi = \phi \nu, $$ where $\phi$ is a smooth function on $\Sigma$ and $\nu$ is the unit normal to $\Sigma_0$ in the direction of $\partial_t\Phi$. Then the volume measure $d\mu_t$ of $g_t$ satisfies $$ \partial_td\mu_t = \phi H d\mu_t. $$
Here is a calculation using local coordinates $(x^1, \dots, x^{n-1})$ on $\Sigma$ that shows this: For $t \ge 0$, let \begin{align*} g_{ij} &= g_M(\partial_i\Phi_t,\partial_j\Phi_t). \end{align*} It follows that \begin{align*} \partial_tg_{ij} &= g_M(\nabla_t\partial_i\Phi,\partial_j\Phi) + g_M(\partial_i\Phi,\nabla_t\partial_j\Phi)\\ &= g_M(\nabla_i\partial_t\Phi,\partial_j\Phi) + g_M(\partial_i\Phi,\nabla_j\partial_t\Phi)\\ &= g_M(\nabla_i(\phi\nu),\partial_j\Phi) + g_M(\partial_i\Phi,\nabla_j(\phi\nu))\\ &= \phi(g_M(\nabla_i\nu,\partial_j\Phi)+g_M(\partial_i\Phi,\nabla_j\nu))\\ &= 2\phi A_{ij}, \end{align*} where $A_{ij}\,dx^i\,dx^j$ is the second fundamental form. On the other hand, \begin{align*} \partial_td\mu_t &= \partial_t(\sqrt{\det g})\,dx. \end{align*} The formula now follows from the standard formula for the derivative of the determinant (when it is positive): \begin{align*} \partial_t(\log\det M)&= \operatorname{trace} M^{-1}\partial_tM. \end{align*}
A key trick here is to pull everything back to the fixed manifold $\Sigma$ and view everything as time-dependent functions on $\Sigma$. Viewing them as functions on $\Sigma_t$ is more confusing for me, so I avoid it.
You'll have to verify carefully that each step is valid. If you haven't done calculations like this in the past, you might want to use local coordinates on $M$ as well. There are slicker ways to justify the calculation above, but I always recommend working it out in local coordinates first.