Solved – LOWESS implementation in Stata vs. R (and Python)

nonparametricrsmoothingstatastatsmodels

Recently I was comparing the output of LOWESS regressions performed in R (and using Python's statsmodels module) and Stata. I realized that some of the values obtained by Stata seem to be off; specifically, it's the tails that seem to be estimated incorrectly.

I dove into the source code of the R's lowess() function (which seems to be based on Cleveland's original Fortran code found here) and the legacy Stata code for the ksm program found in the Stata 7 ado update file (found here). Note that post-v7 Statas implemented the lowess command using the _LOWESS C routine that is not exposed to the user. I verified that running Stata 7's ksm command using the optional lowess argument in Stata 14 generates the same ("incorrect") output as running lowess directly. That is, ksm Y X, lowess is the same as lowess Y X.

After inspecting the respective source codes I think the problem with Stata's implementation lies in fact that the subset size used in each local regression depends on the ordinal position of the X values. That is, for extreme and near-extreme values of X (in other words, for values close to the tails), Stata uses a smaller subset than for more central X's. Intuitively, the problem can be illustrated using a simple example with 100 data points where the bandwidth parameter is chosen to be 0.4 so that each subset is of the size 0.4*100=40. In R, the size of the subsets used for estimating $Y_1$, $Y_{10}$, $Y_{20}$, $Y_{30}$ and $Y_{40}$ would look like this:

On the other hand, Stata seems to reduce the size of the subsets for all X's from (roughly) 1 to 20 and from 80 to 100:

When I rewrote the ksm program such that the subsets size was held fixed for all X's, I got the same results as in R (or Python). Below are the pseudo-codes of both implementations:

# bwidth = bandwidth
# count = N
# i = index of local regression considered

# Cleveland's Fortran implementation
size = max(min(floor(bwidth*count), count), 2) #size of neighborhood
lower_bound = 1
upper_bound = size
for each i
    x = X[i]

    diff_left = x - X[lower_bound]
    diff_right = X[upper_bound + 1] - x
    if diff_left > diff_right and diff_right < count
        increment lower_bound and upper_bound by 1

    h = max(x - X[lower_bound], X[upper_bound] - x)
    h9 = 0.999*h and h1 = 0.001*h
    r = abs(X - x)
    W = 1 - (abs(X - x)/h)^3)^3 if r > h1 and r <= h9, W = 1 if r <= h1, W = 0 otherwise
    run WLS of Y on X using weights W if within lower_bound/upper_bound, get Y_hat[i]

# Stata implementation
k = int((bwidth*count - 0.5)/2) #half-bandwidth
for each i
    x = X[i]

    lower_bound = max(1, i - k)
    upper_bound = min(i + k, count)

    h = 1.0001*max(X[upper_bound] - x, x - X[lower_bound])
    W = 1 - (abs(X - x)/h)^3)^3 if h != 0, W = 1 otherwise
    run WLS of Y on X using weights W if within lower_bound/upper_bound, get Y_hat[i]

I understand that sometimes the implementations differ across software tools (or even packages made for the same software). That being said, I would expect to find at least some online references discussing such non-standard approaches. I tried to find as many other implementations of the LOWESS regression as I could and they all seem to be based on the Fortran code mentioned above and/or at least follow the same approach. I was not able to find any references discussing the specific approach used by Stata.

I would like to know whether such an implementation of LOWESS is correct, whether it's some "proprietary", quasi-correct method or simply a bug. In the former case, I would also appreciate if someone could please point me to a (preferably academic) reference. Thanks!

EDIT I: I wonder if this is because historically Stata's lowess command was based on the ksm program, which (when used with its default settings) was not meant to estimate the LOWESS regression. It appears that since Stata 8, the decision was made to abandon ksm and instead implement lowess where the default was chosen to be the LOWESS regression (whereas ksm's default options were implemented as lowess's non-defaults). The treatment of the tails, however, remained the same as in ksm.

EDIT II: Note that all of these comparisons relied on matching all other parameters of the LOWESS regressions. As Stata doesn't allow for multiple iterations and also doesn't implement the interpolation to cut down on the number of local regressions required (as do R and the original Fortran code), I made sure R's command used both iter=0 and delta=0 parameters. Specifically, I was comparing lowess(X, Y, f=0.4, delta=0, iter=0) (R) with lowess Y X, bwidth(0.4) (Stata) and statsmodels.api.nonparametric.lowess(Y, X, frac=0.4, it=0, delta=0) (Python). Both X and Y were well-behaved.

UPDATE: OK, it seems I wasn't the first to notice this. I just found an old user-written adjksm command, which "is identical to ksm except that the bandwidth of the smoother is constant along the x-axis." See here. I also confirmed that adjksm Y X, bwidth(0.4) lowess generates the same output as R's lowess() using iter=0 and delta=0, up to the 4th decimal point.

Best Answer

Interesting investigation about details that I never paid attention to.

It's not a bug, because it is documented this way in the Stata documentation.

I never looked at the details and references for boundary problems in kernel regression, but it looks like Stata uses a different trade-off between bias and variance at the boundary than R or statsmodels. My guess is that R and statsmodels oversmooth compared to Stata if there are fluctuations or trends close to the boundaries. With constant bandwidth the effective window size would always shrink when going to the boundary, i.e. the Stata behavior results.

So maybe it's a partial misnomer in Stata if there is a strict definition of LOWESS, but Stata's behavior still looks common for windowed convolution filter or similar local polynomial kernel regression with fixed bandwidth.

In this case fixed bandwidth is in terms of neighbors and not in terms of metric distance, where Stata does not trade off neighbors from one side to the other side by extending the one-sided window.

For references I would look at Fan and Gijbels and related literature on boundary problems of kernel or local polynomial regression.

Related Solutions

Solved – R/Stata package for zero-truncated negative binomial GEE

For R two options spring to mind, both of which I am only vaguely familiar with at best.

The first is the pscl package, which can fit zero ~~truncated~~ inflated and hurdle models in a very nice, flexible manner. The pscl package suggests the use of the sandwich package which provides "Model-robust standard error estimators for cross-sectional, time series and longitudinal data". So you could fit your count model and then use the sandwich package to estimate an appropriate covariance matrix for the residuals taking into account the longitudinal nature of the data.

The second option might be to look the geepack package which looks like it can do what you want but only for a negative binomial model with known theta, as it will fit any type of GLM that R's glm() function can (so use the family function from MASS).

A third option has raised it's head: gamlss and it's add-on package gamlss.tr. The latter includes a function gen.trun() that can turn any of the distributions supported by gamlss() into a truncated distribution in a flexible way - you can specify left truncated at 0 negative binomial distribution for example. gamlss() itself includes support for random effects which should take care of the longitudinal nature of the data. It isn't immediately clear however if you have to use at least one smooth function of a covariate in the model or can just model everything as linear functions like in a GLM.

Solved – Generating sorted pseudo-random numbers in Stata

The help for set_seed states

The sequences these functions produce are determined by the seed, which is just a number and which is set to 123456789 every time Stata is launched.

Stata's philosophy emphasizes reproducibility, so this consistency is not surprising. Of course you can set the seed yourself. See the help page for more information.

One way to sort a column separately from all others is to preserve your data, keep only the column to sort, sort it, save the results in a temporary file, restore your data, and merge the temporary file:

gen y = rnormal()
preserve
keep y
sort y
tempfile out
save `out'
restore
merge 1:1 _n using `out', nogen

Best Answer

Related Solutions

Solved – R/Stata package for zero-truncated negative binomial GEE

Solved – Generating sorted pseudo-random numbers in Stata

Related Question