Solved – LOWESS implementation in Stata vs. R (and Python)

nonparametricrsmoothingstatastatsmodels

Recently I was comparing the output of LOWESS regressions performed in R (and using Python's statsmodels module) and Stata. I realized that some of the values obtained by Stata seem to be off; specifically, it's the tails that seem to be estimated incorrectly.

I dove into the source code of the R's lowess() function (which seems to be based on Cleveland's original Fortran code found here) and the legacy Stata code for the ksm program found in the Stata 7 ado update file (found here). Note that post-v7 Statas implemented the lowess command using the _LOWESS C routine that is not exposed to the user. I verified that running Stata 7's ksm command using the optional lowess argument in Stata 14 generates the same ("incorrect") output as running lowess directly. That is, ksm Y X, lowess is the same as lowess Y X.

After inspecting the respective source codes I think the problem with Stata's implementation lies in fact that the subset size used in each local regression depends on the ordinal position of the X values. That is, for extreme and near-extreme values of X (in other words, for values close to the tails), Stata uses a smaller subset than for more central X's. Intuitively, the problem can be illustrated using a simple example with 100 data points where the bandwidth parameter is chosen to be 0.4 so that each subset is of the size 0.4*100=40. In R, the size of the subsets used for estimating $Y_1$, $Y_{10}$, $Y_{20}$, $Y_{30}$ and $Y_{40}$ would look like this:

R's subsets

On the other hand, Stata seems to reduce the size of the subsets for all X's from (roughly) 1 to 20 and from 80 to 100:

Stata's subsets

When I rewrote the ksm program such that the subsets size was held fixed for all X's, I got the same results as in R (or Python). Below are the pseudo-codes of both implementations:

# bwidth = bandwidth
# count = N
# i = index of local regression considered

# Cleveland's Fortran implementation
size = max(min(floor(bwidth*count), count), 2) #size of neighborhood
lower_bound = 1
upper_bound = size
for each i
    x = X[i]

    diff_left = x - X[lower_bound]
    diff_right = X[upper_bound + 1] - x
    if diff_left > diff_right and diff_right < count
        increment lower_bound and upper_bound by 1

    h = max(x - X[lower_bound], X[upper_bound] - x)
    h9 = 0.999*h and h1 = 0.001*h
    r = abs(X - x)
    W = 1 - (abs(X - x)/h)^3)^3 if r > h1 and r <= h9, W = 1 if r <= h1, W = 0 otherwise
    run WLS of Y on X using weights W if within lower_bound/upper_bound, get Y_hat[i]

# Stata implementation
k = int((bwidth*count - 0.5)/2) #half-bandwidth
for each i
    x = X[i]

    lower_bound = max(1, i - k)
    upper_bound = min(i + k, count)

    h = 1.0001*max(X[upper_bound] - x, x - X[lower_bound])
    W = 1 - (abs(X - x)/h)^3)^3 if h != 0, W = 1 otherwise
    run WLS of Y on X using weights W if within lower_bound/upper_bound, get Y_hat[i]

I understand that sometimes the implementations differ across software tools (or even packages made for the same software). That being said, I would expect to find at least some online references discussing such non-standard approaches. I tried to find as many other implementations of the LOWESS regression as I could and they all seem to be based on the Fortran code mentioned above and/or at least follow the same approach. I was not able to find any references discussing the specific approach used by Stata.

I would like to know whether such an implementation of LOWESS is correct, whether it's some "proprietary", quasi-correct method or simply a bug. In the former case, I would also appreciate if someone could please point me to a (preferably academic) reference. Thanks!

EDIT I: I wonder if this is because historically Stata's lowess command was based on the ksm program, which (when used with its default settings) was not meant to estimate the LOWESS regression. It appears that since Stata 8, the decision was made to abandon ksm and instead implement lowess where the default was chosen to be the LOWESS regression (whereas ksm's default options were implemented as lowess's non-defaults). The treatment of the tails, however, remained the same as in ksm.

EDIT II: Note that all of these comparisons relied on matching all other parameters of the LOWESS regressions. As Stata doesn't allow for multiple iterations and also doesn't implement the interpolation to cut down on the number of local regressions required (as do R and the original Fortran code), I made sure R's command used both iter=0 and delta=0 parameters. Specifically, I was comparing lowess(X, Y, f=0.4, delta=0, iter=0) (R) with lowess Y X, bwidth(0.4) (Stata) and statsmodels.api.nonparametric.lowess(Y, X, frac=0.4, it=0, delta=0) (Python). Both X and Y were well-behaved.

UPDATE: OK, it seems I wasn't the first to notice this. I just found an old user-written adjksm command, which "is identical to ksm except that the bandwidth of the smoother is constant along the x-axis." See here. I also confirmed that adjksm Y X, bwidth(0.4) lowess generates the same output as R's lowess() using iter=0 and delta=0, up to the 4th decimal point.

Best Answer

Interesting investigation about details that I never paid attention to.

It's not a bug, because it is documented this way in the Stata documentation.

I never looked at the details and references for boundary problems in kernel regression, but it looks like Stata uses a different trade-off between bias and variance at the boundary than R or statsmodels. My guess is that R and statsmodels oversmooth compared to Stata if there are fluctuations or trends close to the boundaries. With constant bandwidth the effective window size would always shrink when going to the boundary, i.e. the Stata behavior results.

So maybe it's a partial misnomer in Stata if there is a strict definition of LOWESS, but Stata's behavior still looks common for windowed convolution filter or similar local polynomial kernel regression with fixed bandwidth.

In this case fixed bandwidth is in terms of neighbors and not in terms of metric distance, where Stata does not trade off neighbors from one side to the other side by extending the one-sided window.

For references I would look at Fan and Gijbels and related literature on boundary problems of kernel or local polynomial regression.