Connection of functional derivative with variational derivative: $\frac{\delta}{\delta\phi(x)} F[\phi] = \frac{\delta F[\phi]}{\delta\phi}(x)$. Note that the variational derivative carries an extra coordinate variable dependence. It helps to make it explicit when there is similar confusion.
Functional derivative Leibniz rule: $\frac{\delta}{\delta\phi(x)} F[\phi] G[\phi] = \frac{\delta F[\phi]}{\delta\phi}(x) G[\phi] + F[\phi] \frac{\delta G[\phi]}{\delta\phi}(x)$.
Special case: $F_x[\phi] = \phi(x)$, $G_{i,y}[\phi] = (\partial_i\phi)(y)$, and $$\frac{\delta}{\delta\phi(z)} F_x[\phi] G_{i,y}[\phi] = \delta(x-z) (\partial_i\phi)(y) - \phi(x) \frac{d}{dz_i}\delta(y-z)$$.
Notice the distributional coefficients in the derivatives. There is no way to get away from them if you wish to consider $\phi(x)$ and such as functionals in their own right.
If you are interested in the BV formalism in the physics formalism, where the distinction between the functional and variational derivatives is barely remarked, I recommend the reviews by Henneaux and by Gomis, París and Samuel: doi:10.1016/0920-5632(90)90647-D, doi:10.1016/0370-1573(94)00112-G. If you are interested in the BV formalism purely from the point of view of jets, without bringing functionals into the picture, other than peripherally, I recommend the early paper of McCloud and this sequence of papers by Barnich, Brandt and Henneaux: arXiv:hep-th/9307022, arXiv:hep-th/9405109, arXiv:hep-th/9405194, arXiv:hep-th/0002245. If you are more interested in the BV formalism more from the functional point of view, with the appropriate level of functional analysis included, and with jets appearing only peripherally, I recommend the papers by Fredenhagen and Rejzner, as well as Rejzner's thesis: arXiv:1101.5112, arXiv:1110.5232, arXiv:1111.5130.
In many variational problems one is given an action functional $f\mapsto S[f]$, described by an integral
$$ S[f]=\int_\Omega L\bigl(\;x,f(x),D f(x),\dotsc, D^k f(x)\;\bigr) dx $$
in which
- $\Omega$ is a region in some Euclidean space $\mathbb{R}^n$, $x\in\Omega$,
- $f$ is a $k$-times differentiable function $f: \Omega\to\mathbb{R}^m$, and
- the Lagrangian $L$ is a function of the appropriate number of variables.
For example, in classical mechanics $n=1$, $\Omega\subset \mathbb{R}$ is an interval and the function
$$f:\Omega\to \mathbb{R}^m, \;\;\Omega\ni t\mapsto f(t)\in\mathbb{R}^m$$
describes a path in $\mathbb{R}^m$. The Lagrangia has the form $L: \mathbb{R}^m\times\mathbb{R}^m\to \mathbb{R}$, $\newcommand{\bR}{\mathbb{R}}$
$$L(y,v)=\frac{1}{2}|v|^2-U(y), \;\; (y,v)\in\bR^m\times\bR^m $$
where the potential $U$ is a function $U:\bR^m\to\bR$. Then
$$ L(f,\dot{f})= \frac{1}{2}|\dot{f}|^2-U(f), $$
where the dot indicates the derivative with respect to the time parameter $t$ on $\Omega\subset\mathbb{R}$.
The functional (or variational) derivative of $S$ calculated at $f_0:\Omega\to\mathbb{R}^m$ is a gadget $\delta S[f_0]$ that feeds on an infinitesimal deformation $\delta f$ of $f_0$ and returns a scalar
$$ \langle \delta S[f_0], \delta f\rangle =\lim_{h\to 0}\frac{1}{h} \bigl(S[f_0+h\delta f]-S[f_0]\;\bigr). \tag{1}$$
The deformation $\delta f$ is also a function $\Omega\to\mathbb{R}^m$. It is often desirable to identify $\delta S[f_0]$ with a function $g:\Omega\to\mathbb{R}^m$ which, if it exists, is uniquely determined by the equality
$$ \langle \delta S[f_0], \delta f\rangle=\langle g(x), \delta f(x)\rangle =\int_\Omega \bigl( g(x), \delta f(x) \bigr) dx,\tag{2} $$
where $(-,-)$ denotes the natural inner product on $\mathbb{R}^m$. The value of $g$ at $x_0$ can be obtained from the equality
$$ g(x_0)= \langle g(x), \delta(x-x_0)\rangle. \tag{3} $$
This means that the value of $g$ at $x_0$ is obtained by formally replacing $\delta f$ with $\delta(y-x_0)$ in (2).
Making the same formal replacement $\delta f(x)\to\delta(x-x_0)$ in (1) one obtains the physicists' functional derivative in your question.
How does one identify $\delta S[f_0]$ with a function? In the example from classical mechanics one has
$$ S[f_0](t)=-\frac{d}{dt}\frac{\partial L}{\partial v}(f_0(t), \dot{f_0}(t))+\frac{\partial L}{\partial y}(f_0,\dot{f}_0). $$
How can one see this? Fix a function (path) $f_0=f_0(t)$. For simplicity set $\alpha:=\delta f$ and assume $\Omega=[0,1]$. We have a Taylor approximation
$$ L(f_0+h \alpha, \dot{f}_0+h\dot{\alpha}) = L(f_0,\dot{f}_0)+ h\frac{\partial L}{\partial y}(f_0,\dot{f_0})\alpha+h\frac{\partial L}{\partial v}(f_0,\dot{f_0})\dot{\alpha} +O(h^2). $$
Hence
$$\frac{1}{h} \bigl(\; S[f_0+h\alpha]-S[f_0]\;\bigr)=\int_0^1 \frac{\partial L}{\partial y}(f_0,\dot{f_0})\alpha+\frac{\partial L}{\partial v}(f_0,\dot{f_0})\dot \alpha dt +O(h). $$
Letting $h\to 0$ we deduce
$$\langle \delta S[f_0],\alpha\rangle = \int_0^1\frac{\partial L}{\partial y}(f_0,\dot{f_0})\alpha+\frac{\partial L}{\partial v}(f_0,\dot{f_0})\dot{\alpha} dt. $$
If we further assume that $\alpha(0)=\alpha(1)$ then upon integrating by parts we deduce
$$ \langle \delta S[f_0],\alpha\rangle=\int_0^1\Bigl(\frac{\partial L}{\partial y}(f_0,\dot{f_0}) -\frac{d}{dt}\frac{\partial L}{\partial v}(f_0,\dot{f_0})\Bigr) \alpha dt. $$
A good place to look for more details is the book "Calculus of Variations" by Gelfand and Fomin, Dover 2000.
Best Answer
Premise (a long one): before answering your questions, I must say that, if your are searching mathematically rigorous informations, you should not rely on Wikipedia entry "Functional derivative" in its current status, since it is seriously flawed due to an "edit war" between me and another contributor (or perhaps it would be better to say between him and all other contributors, as you can notice having a look at the talk page of the entry). Due to this, the entry is written more from a theoretical physicist's point of view than from a contemporary mathematical perspective, and its content adheres strictly and tacitly to the hypotheses assumed (even implicitly) by Vito Volterra ([6], §II.1.26-II.1.28, pp. 22-24). Namely
Volterra implicitly assumes that the functional $F$ is of integral type, i.e. similar to the functionals encountered in the classical calculus of variation. However, in general functional analysis, this is not always true: for example, the following functional, defined defined on ${C}^1(\Omega)$, $\Omega\subseteq\Bbb R^n$ $$ F[\rho]=\sum_{i=1}^n\frac{\partial\rho}{\partial x_i}(0)=\langle\vec{\mathbf{1}},\nabla \rho(0)\rangle \neq\int\limits_{\Omega}\!\rho(x)\,\mathrm{d}\mu_x,\label{nif}\tag{NIF} $$ cannot be expressed in the form of an integral respect to any given measure, as it is well known from the theory of distributions.
In general, the functional derivative cannot always be represented as the left side term of \eqref{1} since it may not be defined and, even if it happens to be so, it can be different from the central and right side ones (which represent however the true definition of functional derivative), unless it is interpreted as a distribution or as another kind of generalized function by abuse of notation. However, there is a deeper issue, described in the following point.
Volterra explicitly assumes that the variation of $F$ i.e. the quantity $$ \Delta F[\rho]=F[\rho+\delta\rho]-F[\rho]=F[\rho+\varepsilon\phi]-F[\rho] $$ is linear respect to the increment $\delta\rho=\varepsilon \phi$ apart from a remainder behaving as $o(\varepsilon)$ as $\varepsilon\to 0$. Now , while the requirement on asymptotic behavior is basically equivalent to the existence of the limit \eqref{1}, the linearity hypothesis is not always satified ([3], §3.1-3.3, pp. 35-40, and [4] §2.1 p. 15, §3.1-3.3 pp. 30-33). For example, the following functional defined on ${C}^1(\Bbb R^n)$ by using a function $\rho_o\in C^1(\Bbb R^n)$ such that $\rho_0\not\equiv 0$, $$ F[\rho] = \int\limits_{G} \frac{|\nabla(\rho(x)-\rho_o(x))|^2}{\rho(x)-\rho_o(x)}\exp\left(-\frac{|\nabla(\rho(x)-\rho_o(x))|^4}{|\rho(x)-\rho_o(x)|^2}\right) \mathrm{d}x, \quad G\Subset\Bbb R^n \label{nlf}\tag{NLF} $$ (the specification of the precise form of the integrand on the zero set of $\rho-\rho_0$, as well as on intersection between this set and the zero set of its gradient, would require a little more care, but this is only a technical detail and adds nothing to the answer) has a functional derivative which is not linear at the point $\rho_o$. Indeed, given $\phi\in C^1(\Bbb R^n)$ such that $\phi\neq 0$ in $G$, $$ F[\rho_o+\varepsilon \phi] = \varepsilon\int\limits_{G} \frac{|\nabla \phi(x)|^2}{\phi(x)}\exp\left(-\varepsilon^2\frac{|\nabla\phi(x)|^4}{|\phi(x)|^2}\right) \mathrm{d}x, $$ thus $$ \bigg{[}\frac{\mathrm{d}}{\mathrm{d}\varepsilon}F[\rho+\varepsilon \phi]\bigg{]}_{\varepsilon = 0} = \int\limits_{G} \frac{|\nabla \phi(x)|^2}{\phi(x)} \mathrm{d}x $$
Furthermore, while Volterra developed his functional calculus having in mind Banach spaces of continuous functions with respect to the uniform norm (even if the concept of a Banach spaces was not yet defined at the time), theoretical physicists apply it to far more general contexts, in general without any formal justification.
Said that, I can proceed and answer to your questions.
As is, that statement in the entry is not correct without assuming something on where the functional $F$ is defined and thus on its structure. You have correctly noticed one of the basic issues: the functional derivative of $F$ is assumed to be a Gâteaux derivative, but this does not implies its positivity, and moreover it does not need to be representable as a measure, as example \eqref{nif} above shows. For example it can be thought as a distribution, as shown in this answer. Volterra derives the integral representation for the functional derivative on the left side of \eqref{1} under precise hypothesis ([6] §II.1.27 pp. 23-24 and reference [5] §2, pp. 99-102 cited therein), having in mind applications to the classical calculus of variation: under different hypotheses, this may not be true.
The limit depends on the structure of $\phi$, not only on its "size" (i.e. its norm when $M$ is a Banach space): this is probably the core difference between Gâteaux and Fréchet derivatives of functionals, with the former one playing the infinite dimensional analogue of the directional derivative ([1] §1.1 p. 12 and [2] §1.B p. 11). When $M$ is Banach, the statement is clear since $\phi$ enters the definition, equivalent to \eqref{2}, of Fréchet derivative only with its norm, and this implies that any $\phi$ with the same norm does the job: for more general topological vector spaces, things are more complex, but you can have a look at references [4], §3.2-3.2 pp. 30-32 for Gâteaux derivatives and to [2] §1.B p. 11 for Fréchet derivatives (see however [1] remark 1.2 pp. 11-12, on the definition of Fréchet derivatives in locally convex spaces and the issues involved in defining higher order derivatives).
Bibliographical note
Vainberg ([3], [4]) explicitly says that the functional derivative can be a nonlinear functional of the increment: however, he calls it Gâteaux differential, reserving the name "derivative" for the cases where it is a linear functional, and this nomenclature seems to be non standard. All other authors deal extensively only with functionals having linear functional derivatives, sometimes not even mentioning the possibility of the existence of functionals like \eqref{nlf}.
Bibliography
[1] Ambrosetti, Antonio; Prodi, Giovanni, A primer of nonlinear analysis, Cambridge Studies in Advanced Mathematics, 34. Cambridge: Cambridge University Press, pp. viii+171 (1993), ISBN: 0-521-37390-5, MR1225101, ZBL0781.47046.
[2] Schwartz, Jacob T., Nonlinear functional analysis, Notes by H. Fattorini, R. Nirenberg and H. Porta. With an additional chapter by Hermann Karcher. (Notes in Mathematics and its Applications.) New York-London-Paris: Gordon and Breach Science Publishers, pp. VII+236 (1969), MR0433481, ZBL0203.14501.
[3] Vaĭnberg, Mordukhaĭ Moiseevich, Variational methods for the study of nonlinear operators. With a chapter on Newton’s method by L.V. Kantorovich and G.P. Akilov, translated and supplemented by Amiel Feinstein, Holden-Day Series in Mathematical Physics. San Francisco-London- Amsterdam: Holden-Day, Inc. pp. x+323 (1964), MR0176364, ZBL0122.35501.
[4] Vaĭnberg, Mikhail Mordukhovich, Variational method and method of monotone operators in the theory of nonlinear equations. Translated from Russian by A. Libin. Translation edited by D. Louvish, A Halsted Press Book. New York-Toronto: John Wiley & Sons; Jerusalem-London: Israel Program for Scientific Translations, pp. xi+356 (1973), MR0467428, ZBL0279.47022.
[5] Volterra, Vito, "Sulle funzioni che dipendono da altre funzioni [On functions which depend on other functions]" (in Italian), Atti della Reale Accademia dei Lincei, Rendiconti (4) III, No. 2, 97-105, 141-146, 153-158 (1887), JFM19.0408.01.
[6] Volterra, Vito, Theory of functionals and of integral and integro-differential equations. Dover edition with a preface by Griffith C. Evans, a biography of Vito Volterra and a bibliography of his published works by Sir Edmund Whittaker. Unabridged republ. of the first English transl, New York: Dover Publications, Inc. pp. 39+XVI+226 (1959), MR0100765, ZBL0086.10402.