I haven't quite got this straight yet, but I think one way to go is to think about choosing points at random from the positive reals. This answer is going to be rather longer than it really needs to be, because I'm thinking about this in a few (closely related) ways, which probably aren't all necessary, and you can decide to reject the uninteresting parts and keep anything of value. Very roughly, the idea is that if you "randomly" choose points from the positive reals and arrange them in increasing order, then the probability that the $(n+1)^\text{th}$ point is in a small interval $(t,t+dt)$ is a product of probabilities of independent events, $n$ factors of $t$ for choosing $n$ points in the interval $[0,t]$, one factor of $e^{-t}$ as all the other points are in $[t,\infty)$, one factor of $dt$ for choosing the point in $(t,t+dt)$, and a denominator of $n!$ coming from the reordering. At least, as an exercise in making a simple problem much harder, here it goes...
I'll start with a bit of theory before trying to describe intuitively why the probability density $\dfrac{t^n}{n!}e^{-t}$ pops out.
We can look at the homogeneous Poisson process (with rate parameter $1$). One way to think of this is to take a sequence on independent exponentially distributed random variables with rate parameter $1$, $S_1,S_2,\ldots$, and set $T_n=S_1+\cdots+S_n$. As has been commented on already, $T_{n+1}$ has the probability density function $\dfrac{t^n}{n!}e^{-t}$. I'm going to avoid proving this immediately though, as it would just reduce to manipulating some integrals. Then, the Poisson process $X(t)$ counts the number of times $T_i$ lying in the interval $[0,t]$.
We can also look at Poisson point processes (aka, Poisson random measures, but that Wikipedia page is very poor). This is just makes rigorous the idea of randomly choosing unordered sets of points from a sigma-finite measure space $(E,\mathcal{E},\mu)$. Technically, it can be defined as a set of nonnegative integer-valued random variables $\{N(A)\colon A\in\mathcal{E}\}$ counting the number of points chosen from each subset A, such that $N(A)$ has the Poisson distribution of rate $\mu(A)$ and $N(A_1),N(A_2),\ldots$ are independent for pairwise disjoint sets $A_1,A_2,\ldots$. By definition, this satisfies
$$
\begin{array}{}\mathbb{P}(N(A)=n)=\dfrac{\mu(A)^n}{n!}e^{-\mu(A)}.&&(1)\end{array}
$$
The points $T_1,T_2,\ldots$ above defining the homogeneous Poisson process also define a Poisson random measure with respect to the Lebesgue measure $(\mathbb{R}\_+,{\cal B},\lambda)$. Once you forget about the order in which they were defined and just regard them as a random set that is, which I think is the source of the $n!$. If you think about the probability of $T_{n+1}$ being in a small interval $(t,t+\delta t)$ then this is just the same as having $N([0,t])=n$ and $N((t,t+\delta t))=1$, which has probability $\dfrac{t^n}{n!}e^{-t}\delta t$.
So, how can we choose points at random so that each small set $\delta A$ has probability $\mu(\delta A)$ of containing a point, and why does $(1)$ pop out? I'm imagining a hopeless darts player randomly throwing darts about and, purely by luck, hitting the board with some of them. Consider throwing a very large number $N\gg1$ of darts, independently, so that each one only has probability $\mu(A)/N$ of hitting the set, and is distributed according to the probability distribution $\mu/\mu(A)$. This is consistent, at least, if you think about the probability of hitting a subset $B\subseteq A$. The probability of missing with all of them is $(1-\mu(A)/N)^N=e^{-\mu(A)}$. This is a multiplicative function due to independence of the number hitting disjoint sets. To get the probability of one dart hitting the set, multiply by $\mu(A)$ (one factor of $\mu(A)/N$ for each individual dart, multiplied by $N$ because there are $N$ of them). For $n$ darts, we multiply by $\mu(A)$ $n$ times, for picking $n$ darts to hit, then divide by $n!$ because we have over-counted the subsets of size $n$ by this factor (due to counting all $n!$ ways of ordering them). This gives $(1)$. I think this argument can probably be cleaned up a bit.
Getting back to choosing points randomly on the positive reals, this gives a probability of $\dfrac{t^n}{n!}e^{-t}dt$ of picking $n$ in the interval $[0,t]$ and one in $(t,t+dt)$. If we sort them in order as $T_1\lt T_2\lt\cdots$ then $\mathbb{P}(T_1\gt t)=e^{-t}$, so it is exponentially distributed. Conditional on this, $T_2,T_3,\ldots$ are chosen randomly from $[T_1,\infty)$, so we see that the differences $T_{i+1}-T_{i}$ are independent and identically distributed.
Why is $\dfrac{t^n}{n!}e^{-t}$ maximized at $t=n$? I'm not sure why the mode should be a simple property of a distribution. It doesn't even exist except for unimodal distributions. As $T_{n+1}$ is the sum of $n+1$ IID random variables of mean one, the law of large numbers suggests that it should be peaked approximately around $n$. The central limit theorem goes further, and gives $\dfrac{t^n}{n!}e^{-t}\approx\dfrac{1}{\sqrt{2\pi n}}e^{-(t-n)^2/{2n}}$. Stirling's formula is just this evaluated at $t=n$.
What's this to do with Tate's thesis? I don't know, and I haven't read it (but intend to), but have a vague idea of what it's about. If there is anything to do with it, maybe it is something to do with the fact that we are relating the sums of independent random variables $S_1+\cdots+S_n$ distributed with respect to the Haar measure on the multiplicative group $\mathbb{R}_+$ (edit: oops, that's not true, the multiplicative Haar measure has cumulative distribution given by $\log$, not $\exp$) with randomly chosen sets according to the Haar measure on the additive group $\mathbb{R}$.
As you said in your post, the closest example I can think of where we take a derivative of a function wrt another function is the functional derivative:
$$\frac{\delta J}{\delta f(x)} = \frac{\partial \Phi}{\partial f}-\frac{d}{dx}\frac{\partial \Phi}{\partial f'}$$
Where:
$$J[f] = \int_a^b \Phi[f] dx$$
This derivative is defined in a vector space of functions $L$ for some given functional $J[f]: L \to \mathbb{R}$ for $f\in L$.
The result of the functional derivative is still a function of $x$, as can be seen by the several examples on the Wiki page. This is similar to an elementary derivative, where:
$$\frac{d}{dx} x^2 = 2x$$
So, one could argue that even in variational calculus, we are doing something very much like the chain rule in spirit.
Therefore, without actually seeing the context of your notes, I'm afraid its hard to say conclusively what they meant. However, if we restrict ourselves to elementary calculus, then the definition of the derivative does not support the notion of a derivative wrt some function $g$, qua function.
This is because the values of $g$ do not form an ordered field in the same sense that $h \in \mathbb{R}$ do for the definition of an elementary derivative:
$$\lim_{h\to 0} \frac{f(x+h)-f(x)}{h}$$
Even if we allow $g$ to be well-behaved wrt $x$ (i.e., smooth, continuous), the statement $g\to 0$ does not necessarily identify a unique limiting sequence of values for $f(x+g)-f(x)$, and we need that as a key part of our definition of the derivative. For example:
$$\lim_{g\to 0} \frac{f(x+g)-f(x)}{g}$$
For some function $g(x)$ is not well defined, since we have not specified how $g\to 0$. We don't have this problem when we are taking the limit wrt a real number $h\in \mathbb{R}$
Anyway, that is the closest I can come to making sense of such a statement. It would, of course, be helpful if you had the actual notes, but in the interest of helping you move past your concerns, I think that the author was not considering functional analysis in that statement.
As an aside, mathematics is such a huge field that any such sweeping, absolute statement is bound have a counterexample somewhere. For example, one can say it "doesn't make sense" to take the $\sqrt{2}$-th derivative of a function, and you'd be correct if you are talking ordinary calculus, but it would be wrong if we said that the concept, unconditionally, had no meaning, because we have fractional calculus.
Best Answer
I can try to provide some intuition for it, confident that more competent fellow Mathstackexchangers will sort your issue out.
The expression "to differentiate with respect to another function" might be misleading. Moreover, I do not believe this is "functional derivation", which relates to functionals, a map from functions to scalars, not functions. What is being done really, is to differentiate with respect to a transformed variable, very similar to variables change for integration.
Let me clarify with an example, $f(x) = x^2$, $f'(x) = 2x$.
The derivative with respect to a transformed variable $w(x)$ is given by the chain rule
$$ \frac{\mathrm{d}f}{\mathrm{d}w} = \frac{\mathrm{d}f}{\mathrm{d}x} \frac{\mathrm{d}x}{\mathrm{d}w}= \frac{f'}{w'}$$
A simple yet revealing case is maybe $w(x) = x^2$, so that $f = w$.
The formula above tells us it is $1$. It sounds reasonable, like the derivative of $z(x) = x$ $$\frac{\mathrm{d}z}{\mathrm{d}x}=\frac{\mathrm{d}x}{\mathrm{d}x}.$$
This can be interpreted by taking the graph of $f(x) = x^2$, and transforming the horizontal axis according to the transformation $w(x)$.
One then gets a straight line, of unitary slope, in accordance with the value of the derivative given by the formula above, or the derivative of $z$.
One more example, differentiate $f(x) = x^3$ with respect to $w(x)=x^2$. The formula above yields $ \frac{3}{2}x$.
Let us check directly by transforming the $x$-axis to the new variable $x' = x^2$, and hence $x^3 = x' \cdot \sqrt{x'} $, now we derive with respect to $x'$ and obtain $$ \frac{\mathrm{d}f}{\mathrm{d}x'} = \frac{3}{2} \sqrt{x'},$$ and once we transform back from $x'$ to $x$ we confirm the result given by the formula.
Hope it gives some hints. One can perfectly think about it in terms of slopes, once the correct "stretching" of the $x$-axis is made.