I think a better way to see the marginal effect of a given variable, say $X_j$, is to produce a scatter plot of the predicted probability on the vertical axis, and to have $X_j$ on the horizontal axis. This is the most "layman" way I can think of indicating how influential a given variable is. No maths, just pictures. If you have a lot of data points, then a boxplot, or scatterplot smoother may help to see where most of the data is (as oppose to just a cloud of points).
Not sure how "Layman" the next section is, but you may find it useful.
If we look at the marginal effect, call it $m_j$, noting that $g(p)=\sum_kX_k\beta_k$, we get
$$m_j=\frac{\partial p}{\partial X_j}=\frac{\beta_j}{g'\left[g^{-1}(X^T\beta)\right]}=\frac{\beta_j}{g'(p)}$$
So the marginal effect depends on the estimated probability and the gradient of the link function in addition to the beta. The dividing by $g'(p)$, comes from the chain rule for differentiation, and the fact that $\frac{\partial g^{-1}(z)}{\partial z}=\frac{1}{g'\left[g^{-1}(z)\right]}$. This can be shown by differentiating both sides of the obviously true equation $z=g\left[g^{-1}(z)\right]$. We also have that $g^{-1}(X^T\beta)=p$ by definition. For a logit model, we have $g(p)=\log(p)-\log(1-p)\implies g'(p)=\frac{1}{p}+\frac{1}{1-p}=\frac{1}{p(1-p)}$, and the marginal effect is:
$$m_j^{logit}=\beta_jp(1-p)$$
What does this mean? well $p(1-p)$ is zero at $p=0$ and at $p=1$, and it reaches its maximum value of $0.25$ at $p=0.5$. So the marginal effect is greatest when the probability is near $0.5$, and smallest when $p$ is near $0$ or near $1$. However, $p(1-p)$ still depends on $X_j$, so the marginal effects are complicated. In fact, because it depends on $p$, you will get a different marginal effect for different $X_k,\;k\neq j$ values. Possibly one good reason to just do that simple scatter plot - don't need to chose which values of the covariates to use.
For a probit model, we have $g(p)=\Phi^{-1}(p)\implies g'(p)=\frac{1}{\phi\left[\Phi^{-1}(p)\right]}$ where $\Phi(.)$ is standard normal CDF, and $\phi(.)$ is standard normal pdf. So we get:
$$m_j^{probit}=\beta_j\phi\left[\Phi^{-1}(p)\right]$$
Note that this has most of the properties that the $m_j^{logit}$ marginal effect I discussed earlier, and is equally true of any link function which is symmetric about $0.5$ (and sane, of course, e.g. $g(p)=tan(\frac{\pi}{2}[2p-1])$). The dependence on $p$ is more complicated, but still has the general "hump" shape (highest point at $0.5$, lowest at $0$ and $1$). The link function will change the size of the maximum height (e.g. probit maximum is $\frac{1}{\sqrt{2\pi}}\approx 0.4$, logit is $0.25$), and how quickly the marginal effect is tapered towards zero.
The question of what model to use has to do with the objective of the analysis.
If the objective is to develop a classifier to predict binary outcomes, then (as you can see), these three models are all approximately the same and give you approximately the same classifier. That makes it a moot point since you don't care what model develops your classifier and you might use cross validation or split sample validation to determine which model performs best in similar data.
In inference, all models estimate different model parameters. All three regression models are special cases of GLMs which use a link function and a variance structure to determine the relationship between a binary outcome and (in this case) a continuous predictor. The NLS and logistic regression model use the same link function (the logit) but the NLS minimizes squared error in the fitting of the S curve where as the logistic regression is a maximum likelihood estimate of the model data under the assumption of the linear model for model probabilities and the binary distribution of observed outcomes. I can't think of a reason why we'd consider the NLS to be useful for inference.
Probit regression uses a different link function which is the cumulative normal distribution function. This "tapers" faster than a logit and is often used to make inference on binary data that is observed as a binary threshold of unobserved continuous normally distributed outcomes.
Empirically, the logistic regression model is used far more often for analysis of binary data since the model coefficient (odds-ratio) is easy to interpret, it is a maximum likelihood technique, and has good convergence properties.
Best Answer
It is hard to answer your question, as you do not provide the background you have. Here is a general answer.
You want to find $\mathbb{P}[Y_i = 1 | X_i]$ when you have a bunch of data that comes from the same distribution, that is, the result ($Y_i$) is influenced the same way by the inputs ($X_i$) of each sample you have. Read more: I.I.D. (Wikipedia).
You know $X_i$ and $Y_i$ for your samples, and you want to find a function that maps the inputs to the outputs as close as possible. This is what the probit/logit models are doing, and they have a parameter $\beta$ to dictate the influence of the different input variables. To find the best $\beta$, we have first to define what the best is.
In linear regression models, the best model is often defined by the model that has the best $R^2$ measure with the output, or the one that has the smallest Mean Squared Error (Wikipedia). In classification tasks, a common measure is the Logistic loss (Wikipedia).
The value of the loss function is $L(y, f(x,\beta))$ where $L$ is the loss function, $x,y$ the input and output variables, and $\beta$ are the parameters of your model. It is a comparison of your model with the real result $y$, and gives you a metric to evaluate your model. You want to have the lowest error with respect to $\beta$, $$\hat{\beta} = \arg \min_{\beta} L(y,f(x,\beta))$$ Depending on your model and cost function, you can equate the derivative to 0 and compute the $\beta$ that makes it possible based on the data. In the case of the transformed models you are interested about, this is not possible, as as mentionned in the comments, no closed solution exists. However, you can come very close to the optimal solution by using gradient descent (Wikipedia). The idea is that the derivative of the function at a certain $\beta$ will point to the direction of highest increase in the cost function with respect to $\beta$, and if you go in the other direction, your $\beta$ will improve and the error will lower. As long as your cost function is convex, you will fill a global minimum, the optimal $\beta$.
How to find the parameters of the logistic regression is the introduction of a lot of books and classes in Machine learning. For a more specific answer, I'd advise looking at
You can also take a look at questions on a similar topic on CrossValidated,