Loess Plot Interpretation – How to Explain and Understand Loess Regression in R

interpretationloessrregression

I used the following R code:

ggscatter(DF_w, x = "Age", y = "Value", 
          add = "loess", conf.int = TRUE 
          )

to get this plot:

However, I don't understand how I can report this in a scientific paper or presentation. I understand the graph suggests a non-linear relationship between X Age and Y Value, where there is some variability until late life when Value declines with increased Age. Is that all there is to it? Is there a way to show this is a significant relationship with loess or do I need to carry out a regular regression / correlation to determine significance?

Any information on interpreting such plots is appreciated.

Best Answer

Bootstrapping or a permutation test (as suggested by Stephan Kolassa) will help you assess the significance of the apparent (but complex) association in the plot.

You need to adopt a reasonable measure of what the Loess fit has accomplished. One is the mean squared residual. (Many others are possible, but a discussion of that is not pertinent here.) Let's call this the "loss."

You also need to formulate a specific null hypothesis. The simplest is that the data exhibit no trend: that is, they vary randomly and independently around a common value. The best way to estimate this common value with mean squared error loss is to compute its arithmetic mean. With a sufficient amount of data--often ten or more observations suffice--the differences between the response ("Value" in the plot) and this mean will be only slightly (negatively) correlated and can stand in as surrogates for the true random errors.

The permutation distribution in this setting is the distribution of the losses associated with all possible permutations (reorderings) of the residuals after conducting a Loess fit. Under the null hypothesis, all these permutations are equally likely.

The permutation test compares the actual loss to the permutation distribution of losses. As a practical matter, the latter is estimated by means of a few random permutations. (There are far too many permutations to permit generating them all.)

Here, to illustrate, are data generated with no inherent trend along with a permutation distribution estimated from 500 draws. The vertical red line shows the loss for these data: it is close to the middle. Its p-value is computed as usual for a two-sided test: the red line splits the histogram left and right into two areas and the p-value is twice the smaller area. Very small p-values are called "significant" and taken as evidence of some kind of trend. The shape of the Loess plot (shown at left) helps you interpret just what that trend might be.

The large, anodyne p-value is consistent with the trend-free method of generating these data.

For data closer to those in the question, the result is different:

The actual mean squared error of $0.000735$ is inconsistent with the squared errors typical of the permutation distribution: this is a significant trend. The data plot at the left suggests the trend is principally a decrease in mean values from $0.31$ at age 60 to $0.25$ at ages 80 and over.

BTW, it is no surprise that the actual statistics in both cases have approximately the same values: they both estimate the error variance, which was equal to $0.025^2 = 0.000625$ in both cases. The curvilinear trends in the second instance, though, cause the simple fit (under the null hypothesis) to be poorer, thereby shifting the permutation distribution to higher values, as you can see by comparing the two figures.

The R code needed is simple, clear, and efficient. fit performs the Loess fit while stat uses that to compute the mean squared error.

fit <- function(y, x, ...) lowess(x, y, ...)
stat <- function(y, x) mean((y -  fit(y, x)$y)^2) # Mean squared error loss

Given a data frame object X with Value and Age columns to store the response and explanatory variables, respectively, the permutation distribution is estimated by computing predicted values and residuals under the null hypothesis and then iteratively permuting the residuals (with the sample function) and recomputing the loss.

  predicted <- mean(X$Value)
  residuals <- X$Value - predicted
  dsample <- replicate(5e3, stat(predicted + sample(residuals), X$Age))

In this case, after about one second of computation, dsample winds up with 5e3 ($5000$) values randomly drawn from the permutation distribution. The figures are then created by applying hist to dsample to show these values.

  # Compute the p-value
  actual <- with(X, stat(Value, Age))
  stats <- c(actual, dsample)
  p <- mean(stats <= actual)
  p <- 2 * min(1/2, p, 1-p)
  # Display the results
  hist(dsample, freq=FALSE, xlim=range(stats),
       col=gray(.95),
       sub=paste("p-value is approximately", signif(p, 2)),
       main="Simulated Null Permutation Distribution",
       xlab = "Mean Squared Difference")
  abline(v = actual, lwd=2, col="Red") # The statistic for the data

One caveat: I had to tune this Loess fit by specifying a relatively short width for its search radius. This frequently is the case. An honest permutation test must implement this fine-tuning in some automatic way and apply it to each permutation. Otherwise, the original (hand-tuned) fit will be too good and the resulting p-value will be too small--perhaps far too small. People often use some sort of cross-validation technique for such automatic tuning.

Best Answer

Related Solutions

Solved – How to plot multiple loess curves in a xyplot panel data

Solved – How to compare loess models from related datasets

Related Question