R – Difference Between Minimizing RMSE or MSE in Non-Linear Least Squares?

mser

I am working with R with this code from the book "Bootstrap Methods: With Applications in R" by Gerhard Dikta and Marsel Scheer:

set.seed(123,kind ="Mersenne-Twister",normal.kind ="Inversion")
semiparametric_data <-
  data.frame(X = runif(400, min = 1, max = 30)) %>%
    dplyr::mutate(
    mu = 4 * exp(-X/2) - 3 * exp(-X/10), epsilon = rnorm(400, sd = 0.25),
    Y = mu + epsilon)

fit_sp <- minpack.lm::nlsLM(
  formula = Y  ̃ a * exp(X/b) + c * exp(X/d),
  data = semiparametric_data,
  start = c(a = 4, b = -2, c = -3, d = -10),
  control = nls.control(maxiter = 1000))
fit_sp

## Nonlinear regression model
  ##   model: Y  ̃ a * exp(X/b) + c * exp(X/d)
  ##    data: semiparametric_data
  ##      a      b      c      d
  ##  3.707 -2.105 -3.025 -9.797
  ##  residual sum-of-squares: 23.76
  ##
  ## Number of iterations to convergence: 3
  ## Achieved convergence tolerance: 1.49e-08
  1. What does minimize the nls function? RMSE or MSE?
  2. What is the difference between minimizing RMSE or MSE?

By a theoretical/mathematical point of view the resulting coefficients should be the same but in fact the resulting coefficients are slightly different.

  1. What is more efficient to minimize between RMSE and MSE in non linear least squares?

Best Answer

There should be very little difference in the results between minimizing any reasonable monotonic, sign-preserving transformation of the sum of squares (if the sign were flipped we would need to maximize rather than minimizing): sum of squared residuals, the square root of the SSQ, the mean square (SSQ/n), the root mean square ... As you point out, mathematically/statistically there should be no difference. Computationally, the only differences are in floating point accuracy. I can imagine that RMSE could be slightly more accurate in the case where the scale of the response was extremely large or the range of scales of response was very large, but I would be surprised if the difference was noticeable in any practical context.

nls minimizes the sum of squared residuals; if you wanted to play with this you could try different objective functions in one of R's general-purpose minimizers (optim, nlm, nlminb, ...) - but the difference in efficiency between the various minimizers used by optim/nlm/nlminb, the more specialized ones used by nls (Gauss-Newton) and nlsLM (Levenberg-Marquardt), and even particular choices made when implementing these algorithms, are likely to make more difference than the choice of scale of the objective function.