Solved – Parallel straight lines on residual vs fitted plot

multiple regressionrregression

I have a multiple regression problem, which I tried to solve using simple multiple regression:

model1 <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data=data)

This seems to be explaining the 85% of variance (according to R-squared) which seems pretty good.

However what worries me is the weird looking Residuals vs Fitted plot, see below:

enter image description here

I suspect the reason why we have such parallel lines is because the Y value has only 10 unique values corresponding to about 160 of X values.

Perhaps I should use a different type of regression in this case?

Edit: I've seen in the following paper a similar behavior. Note it's a one-page only paper so when you preview it you can read it all. I think it explains pretty well why I observe this behavior but I'm still not sure if any other regression would work better here?

Edit2: The closest example to our case I can think of is the change in interest rates. FED announces new interest rate every few months (we don't know when and how often). In the meantime we gather our independent variables on the daily basis (such as daily inflation rate, stock market data, etc.). As a result we will have a situation where we can have many measurements for one interest rate.

Best Answer

One possible model it one of a "rounded" or "censored" variable : let $y_1,\ldots y_{10}$ being your 10 observed values. One could suppose that there is a latent variable $Z$ representing the "real" price, which you do not fully know. However, you can write $Y_i=y_j\Rightarrow{}y_{j-1}\leq{}Z_i\leq{}y_{j+1}$ (with $y_0=-\infty, y_{11}=+\infty$, if you forgive this abuse of notation). If you are willing to risk a statement about the distribution of Z in each of these intervals, a Bayesian regression becomes trivial ; a maximum likelihood estimation needs a bit more work (but not much, as far as I can tell). Analogues of this problem are treated by Gelman & Hill (2007).