The comments in the code seem to end up defining the two essentially identically (aside a relatively small difference in the constant).
Both are of the form $cAn^{-1/5}$, both with what looks like the same $A$ (estimate of scale), and $c$'s very close to 1 (close relative to the typical uncertainty in the estimate of the optimum bandwidth).
[The binwdith estimate that more usually seems to be associated with Scott is the one from his 1979 paper[1] ($3.49 s n^{-1/3}$) -- e.g. see Wikipedia - scroll down a little - or R's nclass.scott
.]
The 1.059 in what the code calls the "Scott estimate" is in the (prior) book by Silverman (see p45 of the Silverman reference at your link -- Scott's derivation of it is on p130-131 of the book they refer to). It comes from a normal-theory estimate.
The optimum bandwidth (in integrated mean square error terms) is a function of of the integrated squared second derivative, and $1.059\sigma$ comes out of that calculation for a normal, but in many cases that's a good deal wider than is optimum for other distributions.
The $A$ term is an estimate of $\sigma$ (sort of a robustified estimate, in a way that reduces the tendency for it to be too large if there are outliers/skewness/heavy tails). See eq 3.30 on p47, justified on p46-7.
For similar reasons to those I suggested before, Silverman goes on to suggest reducing 1.059 (in fact he actually uses 1.06 throughout, not 1.059 -- as does Scott in his book). He chooses a reduced value that loses no more than 10% efficiency on IMSE at the normal, which is where the 0.9 comes from.
So both those binwidths are based on the IMSE-optimal binwidth at the normal, one right at the optimum, the other (about 15% smaller, to get within 90% of the efficiency of the optimum at the normal). [I'd call both of them "Silverman" estimates. I have no idea why they name the 1.059 one for Scott.]
In my opinion, both are far too large. I don't use histograms to get IMSE-optimal estimates of the density. If that (obtaining estimates of the density that are optimal in the IMSE sense) was what I wanted to do, I wouldn't want to use histograms for that purpose.
Histograms should be erring on the noisier side (let the eye do the necessary smoothing). I nearly always double (or more) the default number of bins these kinds of rules give. So I wouldn't use 1.06 or 0.9, I'd tend to use something around 0.5, maybe less at really large sample sizes.
There's really very little to choose between them, since they both give far too few bins to be much use at finding what's going on in the data (on which, at least at small sample sizes, see here.
[1]: Scott, D.W. (1979), "On optimal and data-based histograms," Biometrika, 66, 605-610.
The mathematical operation here allows also the use of a vector instead of a scalar. Think of it as a weighted sum of vectors:
$$\frac{\sum_i w_i \mathbf y_i}{\sum_k w_k} = \sum_i \left(\frac{w_i}{\sum_k w_k} \right) \mathbf y_i = \sum_i \tilde w_i \mathbf y_i$$
where the coefficients are given in terms of kernel functions
$$
w_i = K_h(\mathbf x-\mathbf x_i)\\[1em]
\Rightarrow \quad\tilde w_i = \frac{K_h(\mathbf x-\mathbf x_i)}{\sum_{k=1}^{n}K_h(\mathbf x-\mathbf x_k)}$$
With this, multivariate Nadaraya-Watson kernel regression simply boils down to a one-dimensional regression in each dimension.
Best Answer
I would recommend you to read this beautiful article by Racine and Li published in the Journal of Econometrics in 2004. They develop a framework to estimate regression functions nonparametrically using kernel methods, with mixed types of covariates (categorical or continuous regressors). Among other results, they show consistency of the cross-validated estimates. This is a classical article in the nonparametric econometrics literature.
The main method for choosing bandwidth parameters is, undoubtfully, the cross-validation procedure. However, other methods exist such as bootstraping (a quick google search gives: a PhD thesis about choosing bandwidths for np kernel regression.
If you have a large sample, leave-one-out CV may not be the way to go for computational reasons. Moreover, if the data consist of time-series, the CV method may not be valid anymore. What you can do is proceed with hold-out validation. Assume you have $T$ observations.
Also, if the Nadaraya-Watson estimator is indeed a np kernel estimator, this is not the case for Lowess, which is a local polynomial regression method. You could also fit your regression function using the Sieves (i.e. through a basis expansion of the function) based on wavelets for example given the structure of your data.
Finally, the np kernel estimation of densities is extremely similar to that of the conditional mean, which is what you have in mind when you talk about 'regression'. A np kernel regression considers estimating $E(Y|X)$, where $Y$ is the dependent variable and $X$ is a (hopefully) exogenous predictor. Replacing Y by $I(Y\leq y)$ - where $I$ denotes the indicator function that equates $1$ when the event inside the brackets occurs - gives the conditional mean expression $E(I(Y\leq y)|X)$. Now, run a bunch of np kernel regression on $E(I(Y\leq y)|X)$ for various values of $y$. This will provide an estimate the conditional cumulative distribution of $Y$ given $X$. Now take a derivative with respect to $y$, you have a density. So what's the difference after all? Just a matter of choosing the dependent variable. The method is the same, unless you want to re-scale and impose restrictions on the CDF maybe...