Solved – Linear Regression with heavy tailed noise

heavy-tailed

The model is linear $y_i = a\cdot x_i + b + e_i,~ i = 1,2,\ldots,N $. It is given that the noise is heavy tailed. However the distribution of noise conditional on $x$ is the same for all data points. My question is that how should I model the data generating process? Should I use a Student-t distribution for the noise process? Should I use M estimator in R?
Facts:
1. OLS is not to be used.
2. 2. The noise distribution can depend on $x$, but is independent across samples.
3. conditioned on the value of $x$, noise's mean is 0.
More clarification: the distribution of noise at $x$, i.e., $N(x)$ is a function of $x$. However for different data samples $i=1,2,\ldots,M$, the distribution of noise samples remains the same.

Best Answer

You could use iterative feasible generalized least squares.

Start by setting weights for each datapoint to 1, i.e. no weighting, and use the following algorithm:

Fit a weighted regression model for each dataset using weights.
Create a single dataset combining squared residuals/errors, $e_i^2$ and their respective $x$ values.
Fit $e_i^2 = a\cdot x_i + b$. If the noise is zero mean, $e_i^2$ is equal to the variance of the error at $x_i$.
Update your weights with the squared errors prediction model
Go back to 1 until convergence.

Related Solutions

Heavy-Tailed Regression – How to Perform Regression with Heavy-Tailed Response Variables

The first thing to note is that the estimators in the linear regression model are not particularly sensitive to heavy tails in the error distribution (so long as the error variance is finite). Fitting a standard linear regression to data with excessively heavy tail will mean that data points in the tails are penalised excessively, but the coefficient estimators in the model are still usually quite reasonable. The main drawback in this situation is that prediction intervals for values will be too short, since they do not account for the heavy tails.

If you would like to adapt your model to deal with the heavier tails, you can use the heavyLm function in the heavy package in R. This function fits a linear model using the T-distribution as the error distribution, which allows you to use an error distribution with heavier tails than the normal. The only drawback of the package is that it requires you to specify the degrees-of-freedom parameter for the error distribution, rather than just estimating this from the data. However, with some creative looping, you could even estimate this parameter if you wanted to. In any case, this model should allow you to get estimates for a linear regression, where the error distribution has heavier tails than the normal distribution, and so your corresponding residual density plot and residual QQ plot should be close to the stipulated error distribution.

Distribution Sampling – Sampling from Heavy-Tailed vs Light-Tailed Distributions

Cauchy. The reason for the the strange histogram from Cauchy data is precisely because you are getting many extreme values in the tails--too sparse and too extreme to show well on your histogram. A data summary or boxplot might be more useful to visualize what's going on.

set.seed(999)
x = rcauchy(10000)
summary(x)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5649.323    -0.970     0.021    -0.037     1.005  2944.847 
x.trnk = x[abs(x) < 200]  # omit a few extreme values for hist
length(x.trnk)
[1] 9971

par(mfrow=c(2,1))
dcauchy(0)
[1] 0.3183099   # Height needed for density plot in histogram
 hist(x.trnk, prob=T, br=100, ylim=c(0,.32), col="skyblue2")
  curve(dcauchy(x), add=T, col="red", n=10001)
 boxplot(x.trnk, horizontal=T, col="skyblue2", pch=20)
par(mfrow=c(1,1))

The standard Cauchy distribution (no parameters specified) is the same as Student's t distribution with DF = 1. The density function integrates to $1,$ as appropriate, but its tails are so heavy that the integral for its 'mean' diverges, so its mean doesn't exist. One speaks of its median as the center of the distribution.

Student's t, DF = 10. There is nothing particularly unusual about Student's t distribution with DF = 10. Its tails are somewhat heavier than for standard normal, but not so much heavier that it's hard to make a useful histogram (no truncation needed). And its mean is $\mu=0.$

y = rt(10000, 10)
summary(y)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.988219 -0.698855 -0.006711 -0.005902  0.685740  6.481538 
dt(0,10)
[1] 0.3891084
par(mfrow=c(2,1))
hist(y, prob=T, br=30, ylim=c(0,.4), col="skyblue2")
 curve(dt(x,10), add=T, col="red", n=10001)
boxplot(y, horizontal=T, col="skyblue2", pch=20)
par(mfrow=c(1,1))

The distribution $\mathsf{T}(10)$ is sufficiently heavy-tailed that samples from it as large as $n=10\,000$ tend to show many boxplot outliers---as seen above. In a simulation of $100\,000$ samples of size $10\,000,$ almost every sample showed at least one outlier and the average number of outliers per sample was more than 180. [This simulation runs slowly because each sample of $10,000$ needs to be sorted in order to determine its outliers.]

set.seed(2020)
nr.out = replicate(10^5, length(boxplot.stats(rt(10000,10))$out))
mean(nr.out)
[1] 188.5043
mean(nr.out>0)
[1] 1

Best Answer

Related Solutions

Heavy-Tailed Regression – How to Perform Regression with Heavy-Tailed Response Variables

Distribution Sampling – Sampling from Heavy-Tailed vs Light-Tailed Distributions

Related Question