The term "Yang-Mills theory" in the mass gap problem refers to a particular QFT. It is believed that this QFT (meaning its Hilbert space of states and its observable operators) should be defined in terms of a measure on the space of connections on $\mathbb{R}^4$; roughly speaking, the moments of this measure are the matrix elements of the operator-valued distributions. (People also use the term "gauge theory" to refer to any QFT, like QCD, which has a Yang-Mills sub-theory.)
The mass gap problem really has two aspects: First, one has to construct an appropriate measure $d\mu$ on some space of connections. Then, one has to work out which functions on the space of connections are integrable with respect to this measure, and show that the corresponding collection of operators includes an energy operator (i.e. a generator of time translations) which has a gap in its spectrum.
You'll have to read the literature to really learn anything about this stuff, but I can make a few points to help you on your way. Be warned that what follows is a caricature. (Hopefully, a helpful one for people trying to learn this stuff.)
About the Measure:
First, the measure isn't really defined on the space of connections. Rather, it should be defined on space $\mathcal{F}$ of continuous linear functionals on the space $\mathcal{S}$ of smooth rapidly-vanishing $\mathfrak{g}$-valued vector fields on $\mathbb{R}^4$, where $\mathfrak{g}$ is the Lie algebra of the gauge group $G$. The space ;$\mathcal{F}$ contains the space of connections, since any connection on $\mathbb{R}^4$ can be written as $d$ plus a $\mathfrak{g}$-valued $1$-form and paired with a vector field via the Killing form, but it also has lots of more "distributional" elements.
We're supposed to get $d\mu$ as the "infinite-volume/continuum limit" of a collection of regularized measures. This means that we are going to write $\mathcal{S}$ as an increasing union of chosen finite-dimensional vector spaces $\mathcal{S}(V,\epsilon)$; these spaces are spanned by chosen vector fields which have support in some finite-volume $V \subset \mathbb{R}^4$ and which are slowly varying on distance scales smaller than $\epsilon$. (You should imagine we're choosing a wavelet basis for $\mathcal{S}$.) Then we're going to construct a measure $d\mu_\hbar(V,\epsilon)$ on the finite dimensional dual vector space $\mathcal{F}(V,\epsilon) = \mathcal{S}(V,\epsilon)' \subset \mathcal{F}$; these measures will have the form $\frac{1}{\mathcal{Z}(V,\epsilon,\hbar)} e^{-\frac{1}{\hbar}S_{V,\epsilon,\hbar}(A)}dA$. Here, $dA$ is Lesbesgue-measure on $\mathcal{F}(V,\epsilon)$, and $S_{V,\epsilon,\hbar}$ is some discretization of the Yang-Mills action, which is adapted to the subspace $\mathcal{F}(V,\epsilon)$. (Usually, people use some version of Wilson's lattice action.)
Existence of the Yang-Mills measure means that one can choose $S_{V,\epsilon}$ as a function of $V$,$\epsilon$, and $\hbar$ so that the limit $d\mu_\hbar$ exists as a measure on $\mathcal{F}$ as $vol(V)/\epsilon \to \infty$. We also demand that the $\hbar \to 0$ limit of $d\mu_\hbar$ is supported on the space of critical points of the classical Yang-Mills equations. (We want to tune the discretized actions to fix the classical limit.)
About the Integrable Functions:
Generally speaking, the functions we'd like to integrate should be expressed in terms of the "coordinate functions" which map the $A$ to $A(f)$, where $f$ is one of the basis elements we used to define the subspaces $\mathcal{S}(V,\epsilon)$. You should imagine that $f$ is a bump vector field, supported near $x \in \mathbb{R}^4$ so that these functions approximate the map sending a $\mathfrak{g}$-valued $1$-form to the value $A_{i,a}(x)$ of its $(i,a)$-th component.
There are three warnings to keep in mind:
First, we only want to look at functions on $\mathcal{F}$ which are invariant under the group of gauge transformations. So the coordinate functions themselves are not OK. But gauge invariant combinations, like the trace of the curvature at a point, or the holonomy of a connection around a loop are OK.
Second, when expressing observables in terms of the coordinate functions, you have to be careful, because the naive classical expressions don't always carry over. The expectation value of the function $A \mapsto A_{i,a}(x)A_{j,b}(y)$ with respect to $d\mu_\hbar$ (for $\hbar \neq 0$) is going to be singular as $x \to y$. This is OK, because we were expecting these moments to define the matrix elements of operator-valued distributions. But it means we have to be careful when considering the expectation values of functions like $A \mapsto A(x)^2$. Some modifications may be required to obtain well-defined quantities. (The simplest example is normal-ordering, which you can see in many two-dimensional QFTs.)
Finally, the real problem. Yang-Mills theory should confine. This means, very very roughly, that there are some observables which make sense in the classical theory but which are not well-defined in the quantum theory; quantum mechanical effects prevent the phenomena that these observables describe. In the measure theoretic formulation, you see this by watching the expection values of these suspect observables diverge (or otherwise fail to remain integrable) as you approach the infinite-volume limit.
About the Operators:
In classical Yang-Mills theory, the coordinate observables $A \mapsto A_{i,a}(x)$ satisfy equations of motion, the Yang-Mills equations. Moreover, in classical field theory, for pure states, the expectation value of a product of observables $\mathcal{O}_1\mathcal{O}_2...$ is the product of the individual expectation values. In quantum YM, the situation is more complicated: coordinate observables may not have be well-defined, thanks to confinement, and in any case, observables only satisfy equations of motion in a fairly weak sense: If $\mathcal{O}_1$ is an expression which would vanish in the classical theory thanks to the equations of motion, then then the expectation value of $\mathcal{O}_1\mathcal{O}_2...$ is a distribution supported on the supports of $\mathcal{O}_2...$.
Actually, Newton did NOT say that $F = m a$ (i.e., ${d^2 x \over dt^2} = {1\over m} F$) in the Principia. First of all, if he did, no one at that time would have understood what it meant since that was pre-calculus times. What he did say in his 2nd Law was that "Change of motion is proportional to impressed motive force and is in the same direction as the impressed force", i.e., in modern terminology, the instantaneous change of momentum (caused by something like a hammer blow) is equal to the applied impulse. Whenever he used the 2nd Law, he treated a smooth force as the limit of a large number of small impulses. It was only much later that Euler recast the 2nd Law as a 2nd order ODE. This is all discussed in considerable detail in a recent book ``Differential Equations, Mechanics, and Computation'' (that I co-authored with my son Robert). There is a "Web Companion" for the book at the URL http://ode-math.com where you can freely download more than half the book as pdf files. In particular, if you download the first pages of "Chapter 4: Newton's Equations" you will see all of this discussed in considerable detail. One more point: this book was explicitly written to be, as we stress, a conceptual introduction to the subject for someone like yourself who is learning this material for the first time. (See here: http://ode-math.com/NovelFeaturesOfODECM.html)
All of the above is somewhat tangential to your specific question, so let me add that a very major restriction imposed by saying that the laws of motion for say $n$ particles are (equivalent to) 2nd order ODE is that if you know the positions and velocities of the particles at any one instant then their positions at any time in the past and future are uniquely determined by that data. Or as Laplace said in a very famous quote "The current state of Nature is evidently a consequence of what it was in the preceding moment, and if we conceive of an intelligence that at a given moment knows the relations of all things of this Universe, it could then tell the positions, motions and effects of all of these entities at any past or future time. . ." (Of course we now know that the existence of "chaotic behavior" renders that only a very theoretical possibility.)
Best Answer
Yes indeed, the Maxwell's equations are Euler-Lagrange equations. And this is quite interesting. Let me give here a presentation within Special Relativity, in which the light speed is set to $c=1$. The ambiant space is therefore a Minkowski space $\mathbb R^{1+3}$ with metric $dt^2-d{x_1}^2-d{x_2}^2-d{x_3}^2$. I restrict myself to the case of a vacuum.
The electromagnetic field is by definition a closed two-form $\Omega$. The components of the field can be retrieved, in a given coordinates frame, by $$\Omega=\vec E\cdot dt\times dx+\vec B\cdot(dx\times dx).$$ The constraint $d\Omega=0$ writes $$\partial_t\vec B+{\rm curl}_x\vec E=0,\qquad{\rm div}_x\vec B=0.$$
The rest of the Maxwell's equations are obtain by writing $$\delta\int\int L(\Omega)dxdt=0,$$ still under the constraint that the variations of $\Omega$ are compatible with the closedness. Writing $L$ as a function of $(\vec B,\vec E)$, we obtain $$\partial_t\vec D-{\rm curl}_x\vec H=0,\qquad{\rm div}_x\vec D=0,\qquad(\dagger)$$ where $$\vec D=\frac{\partial L}{\partial\vec E},\qquad\vec H=\frac{\partial L}{\partial\vec B}.$$ An important point is that $L$ must be invariant under Lorentz transformation. This translates the following way: there exists a function $\ell$ of two scalar variables only, such that $$L(\Omega)=\ell\left(\frac12(\vec E^2-\vec B^2),\vec E\cdot\vec B\right).$$ For instance, the choice $L=\frac12(\vec E^2-\vec B^2)$ yields the standard, linear Maxwell's equations (in which $D=E$ and $H=-B$).
The energy density is a partial Legendre transform, $$W=\vec D\cdot\vec E-L.$$ The Poynting vector is $\vec E\times\vec H$. It also equals $\vec D\times\vec B$. The fact that both formulas give the same quantity is equivalent to the Lorentz invariance.
Edit. Here are some of the details. The variational principle $\delta{\cal L}=0$, where ${\cal L}(\Omega)=\int\int L(\Omega)dxdt$, means that $$\left.\frac{d}{d\epsilon}\right|_{\epsilon=0}\int\int L(\Omega+\epsilon\alpha)dxdt=0$$ for every closed $2$-form $\alpha$. Equivalently, we have $$\left.\frac{d}{d\epsilon}\right|_{\epsilon=0}\int\int L(\Omega+\epsilon d\beta)dxdt=0$$ for every $1$-form (say smooth, compactly supported) $\beta$. Let us write $\beta=\phi dt-\vec A\cdot dx$, then $d\beta=(\partial_t\vec A+\nabla\phi)\cdot dt\times dx+{\rm curl}\vec A \cdot dx\times dx$. We therefore have $$\int\int(\vec H\cdot{\rm curl}\vec A+\vec D\cdot(\partial_t\vec A+\nabla\phi))dxdt=0$$ for every test function $\phi$ and field $\vec A$. This gives $\partial_t\vec D-{\rm curl}\vec H=0$ and ${\rm div}\vec H=0$.
Edit. In a recent paper, I explore a variant of the variational principle, in which the admissible variations run over the same class as $\Omega$, modulo pullback composition by a diffeomorphism: $\delta\Omega=\Omega-\phi^*\Omega$. This is narrower than the additive perturbation considered above. The resulting equations ar interesting. We don't obtain the full ($\dagger$), but we do obtain the conservation of energy $$\partial_tW+{\rm div}(\vec E\times\vec H)=0,$$ and that of momentum, which are perhaps more natural.