It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
Although many examples of Lagrangian duality in textbooks involve functions and constraints for which it is easy to minimize the Lagrangian for a fixed Lagrange multiplier, there are situations where there is no explicit formula for the dual function. For example, consider the problem:
$\min \| Ax - b \|_{1}$
subject to
$Cx=d$.
The Lagrangian is
$L(x,\lambda)=\| Ax-b \|_{1} + \lambda^{T}(Cx-d)$
The dual function is
$g(\lambda)=\inf_{x} \| Ax-b \|_{1} + \lambda^{T}(Cx-d)$.
There's no explicit formula for $g(\lambda)$, although it's possible to evaluate $g(\lambda)$ by solving an LP.
Best Answer
Think about the differentiability of a function $f:\left(a,b\right)\subset\mathbb{R}\to \mathbb{R}$. If $x\in \left(a,b\right)$, then $f'\left(x\right)$ is ordinarily defined to be the real number
\begin{equation*} f'\left(x\right) = \lim_{h\to 0}{\frac{f\left(x+h\right)-f\left(x\right)}{h}}, \end{equation*}
provided that the limit exists. Therefore, we can write
\begin{equation*} f\left(x+h\right) - f\left(x\right) = f'\left(x\right)h + r\left(h\right) \end{equation*}
where $r\left(h\right)$ is a remainder term that satisfies
\begin{equation*} \lim_{r\to 0}{\frac{r\left(h\right)}{h}} = 0. \end{equation*}
This motivates the definition of differentiability for a function $f$ from an open subset $S\subset\mathbb{R}^{n}$ to $\mathbb{R}$. Namely, $f:S\subset\mathbb{R}^{n}\to\mathbb{R}$ is differentiable at the point $\mathbf{a}\in S$ if there exists $\mathbf{c}\in \mathbb{R}^{n}$ such that
\begin{equation*} \lim_{\mathbf{h}\to 0}{\frac{f\left(\mathbf{a}+\mathbf{h}\right)-f\left(\mathbf{a}\right) - \mathbf{c}\cdot\mathbf{h}}{\lVert{\mathbf{h}\rVert}}} = 0. \end{equation*}
In this case, $\mathbf{c}$ is called the gradient of $f$ at $\mathbf{a}$ and is denoted $\nabla f\left(\mathbf{a}\right)$.
If we define $E\left(\mathbf{h}\right) = f\left(\mathbf{a}+\mathbf{h}\right) - f\left(\mathbf{a}\right) - \nabla f\left(\mathbf{a}\right)\cdot\mathbf{h}$, then we can write
\begin{equation*} f\left(\mathbf{a}+\mathbf{h}\right) = f\left(\mathbf{a}\right) + \nabla f\left(\mathbf{a}\right)\cdot\mathbf{h} + E\left(\mathbf{h}\right)\,\,\,\,\mbox{ where }\,\,\,\,\frac{E\left(\mathbf{h}\right)}{\lVert{\mathbf{h}\rVert}}\to 0\,\,\,\,\mbox{ as }\,\,\,\,\mathbf{h}\to 0. \end{equation*}
The motivation for this definition is that it allows us to view the $f\left(\mathbf{a}\right) + \nabla f\left(\mathbf{a}\right)\cdot\mathbf{h}$ as a linear (or rather, affine) approximation to $f\left(\mathbf{a}+\mathbf{h}\right)$.
In your case, we adjust for notation by taking $\overline{x} = \mathbf{a}$ and $\mathbf{h} = x-\overline{x}$, and taking $\nabla f$ to be a row vector rather than a column vector, this becomes
\begin{equation*} f\left(x\right) = f\left(\overline{x}\right) + \nabla f\left(\overline{x}\right)^{t}\left(x-\overline{x}\right) + E\left(x-\overline{x}\right) \end{equation*}
where
\begin{equation*} \frac{E\left(x-\overline{x}\right)}{\lVert{x-\overline{x}\rVert}}\to 0\,\,\,\,\mbox{ as }\,\,\,\,x-\overline{x}\to 0. \end{equation*}
Then you can see that
\begin{equation*} E\left(x-\overline{x}\right) = \lVert{x-\overline{x}\rVert}\beta\left(\overline{x};x\right) \implies \beta\left(\overline{x};x\right) = \frac{E\left(x-\overline{x}\right)}{\lVert{x-\overline{x}\rVert}}, \end{equation*}
so the requirement that $E\left(x-\overline{x}\right)/\lVert{x-\overline{x}\rVert}\to 0$ as $x\to\overline{x}$ (i.e., as $x-\overline{x}\to 0$) is equivalent to $\beta\left(\overline{x};x\right)\to 0$ as $x\to\overline{x}$
Oh, and the semicolon is just used to emphasize that $\beta$ depends on $x$ and on $\overline{x}$--it could just as well be written $\beta\left(\overline{x},x\right)$, or any other way that makes this clear.