Solved – How many significant figures should I report for a regression equation

precisionregressionreportinguncertainty

I have a regression equation that I am trying to include as part of a study predicting a variable of interest using a linear proxy using a simple ordinary least squares function. I did the regression analyses using lm in R and the results it gave me had an extremely high reported precision, five decimal places for the variables and four decimal places. Some of my colleagues that I showed the study to said that the high number of decimal places made them raise their eyebrows, especially given that the linear variable that I measured was only measured to the nearest 0.01 mm.

I was wondering if there was any way to determine how many significant figures I should report for the slope, intercept, and r2 value, or if I should report all values that R reports to me to the furthest digit. Part of the reason I am a bit unsure if I should round or not is the data are biological data and hence both are log-transformed. As a result, I am concerned that reducing the precision of the variables will result in a significant less accurate regression equation, since it would have more of an effect on the regression than one calculated on non-log-transformed values.

Best Answer

My eyebrows go up when numbers are reported with far too much precision, too, but there's more going on in a regression setting than we might expect. So much, so, that I won't venture a thorough answer; but I think I can share some useful principles and examples.

Principle 1: Meaningfulness

Rounding $R^2$ to the nearest whole number (thereby producing only $0$ or $1$), for instance, is not of much use. Often two decimal digits suffice, as in $R^2 = 0.69.$ But sometimes more--even much more--precision is called for. The issue is that $1-R^2$ is arguably the more meaningful value. When the response variable is a vector $(y_i),$ the variance of those numbers times $1-R^2$ is the residual variance, so the square root gives us information about the typical sizes of the residuals in your fit. Your readers will likely want some sense of how closely your fit reflects the data and this value is a good statistic for that.

Thus, when you report $R^2=0.99$ (which often occurs in scientific studies, engineering studies, where errors are positively correlated, or when explanatory variables are varied over wide ranges), all one can say about $1-R^2$ is that it lies somewhere between $1-0.985 = 0.015$ and $1-0.995 = 0.005.$ That range covers a factor of three--half an order of magnitude, or just 1/2 significant digits of precision.

Moral: Report $R^2$ with enough precision that $1-R^2$ is also precisely known.

Principle 2: Reproducibility

Others might want to reproduce your calculations. Present your statistics with enough precision that they can be confident little errors are not sneaking into their (or your) computations. For instance, if you have $n=1000$ data then one possible source of discrepancy could be the use of $1/(n-1)$ instead of $1/n$ in computing variances. That would be a relative error of about $1/n,$ or just one part in a thousand. In such a case you ought to consider reporting all results with at least three decimal places of precision (plus perhaps one or two more).

Moral: Use more precision in reporting statistics associated with large datasets.

Principle 3: Applicability

Regression coefficients are plugged into a model to produce predicted (fitted) values and from them, via subtraction, residuals and other quantities. In some models these quantities of interest are hugely smaller in magnitude than the coefficients or the values of variables used to compute them.

One striking example is regression of quantities against a date variable. When a recent date is represented (as many often are) in seconds elapsed since Jan 1, 1970, internally it is a quantity in the billions. The intercept (which often is meaningless) may be huge, requiring many more significant digits for accurate calculation.

Here, for instance, is a synthetic dataset of eight daily temperature readings at a weather station in January 2020: about as plain and simple a situation as one can conceive of.

Figure

The least-squares estimates look innocuous--but notice how large the intercept is compared to the slope:

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.358e+04  1.640e+03  -8.280 0.000168 ***
Date         8.607e-06  1.039e-06   8.282 0.000168 ***

I computed the predicted value for the first day by rounding both coefficients to 1, 2, ..., 8 significant figures (sf) and finally by using the full double-precision values (about 15.6 sf), producing these possible values for the first day:

  1 sf 2 sf   3 sf 4 sf  5 sf   6 sf   7 sf   8 sf     DP
  4200 -430 -14.10 1.18 0.396 0.0591 0.0523 0.0531 0.0533

Notice how ridiculous the values are even when using three sf and how they don't produce even the first digit of the correct value until 6 sf are used (two more than the default output shown above!). Even when using 8 sf, the third digit of the value is still incorrect.

Moral: Beware the potential for linear combinations of numbers to produce results of much smaller magnitude. Report all values with enough significant figures to reproduce the linear combinations.

Principle 4: Watch out for transformations

The questions mentions using logarithms of data. Logarithms and other "strong" non-linear transformations can change the precision in varying ways. For example, when you round a positive result to two decimal places, then the value $0.001$ (whose logarithm is $-3$) is rounded to $0,$ so all your readers will know is the original value was between $0$ and $0.01$ and its logarithm is anywhere between $-\infty$ and $-2.$ That's truly poor precision! In some situations this won't matter, but in others it can make your results useless.

Moral: When your readers might need to transform your results, report your results with enough precision so that the results of those transformations are adequately precise.

Summary

A common thread to these principles is the need to be aware of how people might use the values you report. The choice of significant figures in reporting regression results is especially subtle because the values have such versatile uses, ranging from reproducing your calculations to forecasts, extrapolation, post hoc testing, and more. It might be prudent, then, to use several more significant figures of precision in your regression summaries than you might otherwise be inclined to.

Related Question