The question of what model to use has to do with the objective of the analysis.
If the objective is to develop a classifier to predict binary outcomes, then (as you can see), these three models are all approximately the same and give you approximately the same classifier. That makes it a moot point since you don't care what model develops your classifier and you might use cross validation or split sample validation to determine which model performs best in similar data.
In inference, all models estimate different model parameters. All three regression models are special cases of GLMs which use a link function and a variance structure to determine the relationship between a binary outcome and (in this case) a continuous predictor. The NLS and logistic regression model use the same link function (the logit) but the NLS minimizes squared error in the fitting of the S curve where as the logistic regression is a maximum likelihood estimate of the model data under the assumption of the linear model for model probabilities and the binary distribution of observed outcomes. I can't think of a reason why we'd consider the NLS to be useful for inference.
Probit regression uses a different link function which is the cumulative normal distribution function. This "tapers" faster than a logit and is often used to make inference on binary data that is observed as a binary threshold of unobserved continuous normally distributed outcomes.
Empirically, the logistic regression model is used far more often for analysis of binary data since the model coefficient (odds-ratio) is easy to interpret, it is a maximum likelihood technique, and has good convergence properties.
I am inclined to agree with @StasK's comment. However, something like what you want is feasible, though a little tricky to interpret. What I propose below tells you how the marginal effect of $x$ varies with $x$.
You know that the conditional mean of the dependent variable in a probit model is
$$\mathbb{Pr}[y=1 \vert x,z]=\Phi(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi).$$
The variable $x$ is what we care about. The vector $z$ contains some other covariates. $\Phi(.)$ is the standard normal cdf, and $\varphi(.)$ is is the standard normal pdf, which will be used below.
The marginal effect of $x$ is
$$\frac{\partial \mathbb{Pr}[y=1 \vert x,z]}{\partial x}=\varphi(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi)\cdot(\beta + 2\cdot\gamma \cdot x).$$
The change in the marginal effect is the second derivative
$$
\frac{\partial^2 \mathbb{Pr}[y=1 \vert x,z]}{\partial x^2} =
\varphi(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi)\cdot(2\cdot\gamma) +(\beta + 2\cdot\gamma \cdot x)\cdot\varphi'(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi).
$$
Since $\varphi′(x)=−x \cdot \varphi(x)$, this "simplifies" to
$$
\frac{\partial^2 \mathbb{Pr}[y=1 \vert x,z]}{\partial x^2} =
\varphi(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi)\cdot \left[ 2\cdot\gamma -(\beta + 2\cdot\gamma \cdot x)^2\cdot(\alpha + \beta \cdot x + \gamma \cdot x^2+z'\pi)\right].
$$
Note that this is a function of $x$ and $z$s, so we can evaluate this quantity at various possible values. Also note that while the first term is surely positive since it is a normal density, it's hard to sign the second term even if you know the sign and magnitude of the coefficients.
Assuming that I didn't screw up the derivative, here's how I might actually do this in Stata:
#delimit;
sysuse auto, clear;
probit foreign c.mpg##c.mpg c.weight, coefl;
/* At own values of covarites */
margins, expression(normalden(predict(xb))*(2*_b[c.mpg#c.mpg] - predict(xb)*(_b[c.mpg]+2*_b[c.mpg#c.mpg]*mpg)^2));
/* At chosen values of covarites */
margins, expression(normalden(predict(xb))*(2*_b[c.mpg#c.mpg] - predict(xb)*(_b[c.mpg]+2*_b[c.mpg#c.mpg]*mpg)^2)) at(mpg=20 weight=3000);
/* At avermpg value of covariates */
margins, expression(normalden(predict(xb))*(2*_b[c.mpg#c.mpg] - predict(xb)*(_b[c.mpg]+2*_b[c.mpg#c.mpg]*mpg)^2)) atmeans;
If I was doing this myself and feeling lazy, I might use adjacent reverse contrasts. For instance, here's the second derivative evaluated at 4 values of $x$:
margins, expression(normalden(predict(xb))*(2*_b[c.mpg#c.mpg] - predict(xb)*(_b[c.mpg]+2*_b[c.mpg#c.mpg]*mpg)^2)) at(mpg = (10 20 30 40));
Here's a comparison of the derivatives. This compares the marginal effects of mpg at mpg of x+1 to the marginal effect at mpg of x:
margins, dydx(mpg) at(mpg = (10 11 20 21 30 31 40 41)) contrast(atcontrast(ar(2(2)8)._at) wald);
Note how close the two commands' outputs are, but the second is so much easier.
I don't know what r is your code, so I can't verify if what you have is correct.
Best Answer
I think a better way to see the marginal effect of a given variable, say $X_j$, is to produce a scatter plot of the predicted probability on the vertical axis, and to have $X_j$ on the horizontal axis. This is the most "layman" way I can think of indicating how influential a given variable is. No maths, just pictures. If you have a lot of data points, then a boxplot, or scatterplot smoother may help to see where most of the data is (as oppose to just a cloud of points).
Not sure how "Layman" the next section is, but you may find it useful.
If we look at the marginal effect, call it $m_j$, noting that $g(p)=\sum_kX_k\beta_k$, we get
$$m_j=\frac{\partial p}{\partial X_j}=\frac{\beta_j}{g'\left[g^{-1}(X^T\beta)\right]}=\frac{\beta_j}{g'(p)}$$
So the marginal effect depends on the estimated probability and the gradient of the link function in addition to the beta. The dividing by $g'(p)$, comes from the chain rule for differentiation, and the fact that $\frac{\partial g^{-1}(z)}{\partial z}=\frac{1}{g'\left[g^{-1}(z)\right]}$. This can be shown by differentiating both sides of the obviously true equation $z=g\left[g^{-1}(z)\right]$. We also have that $g^{-1}(X^T\beta)=p$ by definition. For a logit model, we have $g(p)=\log(p)-\log(1-p)\implies g'(p)=\frac{1}{p}+\frac{1}{1-p}=\frac{1}{p(1-p)}$, and the marginal effect is:
$$m_j^{logit}=\beta_jp(1-p)$$
What does this mean? well $p(1-p)$ is zero at $p=0$ and at $p=1$, and it reaches its maximum value of $0.25$ at $p=0.5$. So the marginal effect is greatest when the probability is near $0.5$, and smallest when $p$ is near $0$ or near $1$. However, $p(1-p)$ still depends on $X_j$, so the marginal effects are complicated. In fact, because it depends on $p$, you will get a different marginal effect for different $X_k,\;k\neq j$ values. Possibly one good reason to just do that simple scatter plot - don't need to chose which values of the covariates to use.
For a probit model, we have $g(p)=\Phi^{-1}(p)\implies g'(p)=\frac{1}{\phi\left[\Phi^{-1}(p)\right]}$ where $\Phi(.)$ is standard normal CDF, and $\phi(.)$ is standard normal pdf. So we get:
$$m_j^{probit}=\beta_j\phi\left[\Phi^{-1}(p)\right]$$
Note that this has most of the properties that the $m_j^{logit}$ marginal effect I discussed earlier, and is equally true of any link function which is symmetric about $0.5$ (and sane, of course, e.g. $g(p)=tan(\frac{\pi}{2}[2p-1])$). The dependence on $p$ is more complicated, but still has the general "hump" shape (highest point at $0.5$, lowest at $0$ and $1$). The link function will change the size of the maximum height (e.g. probit maximum is $\frac{1}{\sqrt{2\pi}}\approx 0.4$, logit is $0.25$), and how quickly the marginal effect is tapered towards zero.