Suppose that the relationship: $Y_i = β_1 + β_2*X_i + u_i$ is being fitted and that the value of X is missing for some observations. One way of handling the missing values problem is to drop those observations. Another is to set X = 0 for the missing observations and include a dummy variable D defined to be equal to 1 if X is missing, 0 otherwise. Demonstrate that the two methods must yield the same estimates of β_1 and β_2. Write down an expression for RSS using the second approach, decompose it into the RSS for observations with X present and RSS for observations with X missing, and determine how the resulting expression is related to RSS when the missing value observations are dropped.
My attempt: Suppose we have n total observations and the first k of them are not missing, and all the rest are missing. So we have model 1 where missing x_i will be 0 and model 2 where d_i is 1 if x_i is missing, and 0 otherwise:
We can see that in RSS_1 and RSS_2 second terms are the same. And the textbook, from which I took this task says that β_1 and β_2 which are minimizing RSS for two models will be the same. Namely, this will be β_1 and β_2 which could be obtained by minimizing the first term of RSS_2, but I have no idea why. Is this because the first term of RSS_1 does not depend on x_i?
Best Answer
In model $2$, let's call $\bar{y}_m$ the mean of the $y_i$ for which $x_i$ is missing, and similarly $RSS_m$ their residual sum of squares, while $RSS_p$ is the residual sum of squares where $x_i$ is present. Clearly $RSS_2=RSS_p+RSS_m$.
Since the minimum of a sum is at least the sum of the minima: