Very short question. What exactly is the difference between an instrumental variable and a proxy variable when building a regression model?
Solved – Proxy variables versus instrumental variables
instrumental-variablesregression
Related Solutions
[The following perhaps seems a little technical because of the use of equations but it builds mainly on the arrow charts to provide the intuition which only requires very basic understanding of OLS - so don't be repulsed.]
Suppose you want to estimate the causal effect of $x_i$ on $y_i$ given by the estimated coefficient for $\beta$, but for some reason there is a correlation between your explanatory variable and the error term:
$$\begin{matrix}y_i &=& \alpha &+& \beta x_i &+& \epsilon_i & \\ & && & & \hspace{-1cm}\nwarrow & \hspace{-0.8cm} \nearrow \\ & & & & & corr & \end{matrix}$$
This might have happened because we forgot to include an important variable that also correlates with $x_i$. This problem is known as omitted variable bias and then your $\widehat{\beta}$ will not give you the causal effect (see here for the details). This is a case when you would want to use an instrument because only then can you find the true causal effect.
An instrument is a new variable $z_i$ which is uncorrelated with $\epsilon_i$, but that correlates well with $x_i$ and which only influences $y_i$ through $x_i$ - so our instrument is what is called "exogenous". It's like in this chart here:
$$\begin{matrix} z_i & \rightarrow & x_i & \rightarrow & y_i \newline & & \uparrow & \nearrow & \newline & & \epsilon_i & \end{matrix}$$
So how do we use this new variable?
Maybe you remember the ANOVA type idea behind regression where you split the total variation of a dependent variable into an explained and an unexplained component. For example, if you regress your $x_i$ on the instrument,
$$\underbrace{x_i}_{\text{total variation}} = \underbrace{a \quad + \quad \pi z_i}_{\text{explained variation}} \quad + \underbrace{\eta_i}_{\text{unexplained variation}}$$
then you know that the explained variation here is exogenous to our original equation because it depends on the exogenous variable $z_i$ only. So in this sense, we split our $x_i$ up into a part that we can claim is certainly exogenous (that's the part that depends on $z_i$) and some unexplained part $\eta_i$ that keeps all the bad variation which correlates with $\epsilon_i$. Now we take the exogenous part of this regression, call it $\widehat{x_i}$,
$$x_i \quad = \underbrace{a \quad + \quad \pi z_i}_{\text{good variation} \: = \: \widehat{x}_i } \quad + \underbrace{\eta_i}_{\text{bad variation}}$$
and put this into our original regression: $$y_i = \alpha + \beta \widehat{x}_i + \epsilon_i$$
Now since $\widehat{x}_i$ is not correlated anymore with $\epsilon_i$ (remember, we "filtered out" this part from $x_i$ and left it in $\eta_i$), we can consistently estimate our $\beta$ because the instrument has helped us to break the correlation between the explanatory variably and the error. This was one way how you can apply instrumental variables. This method is actually called 2-stage least squares, where our regression of $x_i$ on $z_i$ is called the "first stage" and the last equation here is called the "second stage".
In terms of our original picture (I leave out the $\epsilon_i$ to not make a mess but remember that it is there!), instead of taking the direct but flawed route between $x_i$ to $y_i$ we took an intermediate step via $\widehat{x}_i$
$$\begin{matrix} & & & & & \widehat{x}_i \newline & & & & \nearrow & \downarrow \newline & z_i & \rightarrow & x_i & \rightarrow & y_i \end{matrix}$$
Thanks to this slight diversion of our road to the causal effect we were able to consistently estimate $\beta$ by using the instrument. The cost of this diversion is that instrumental variables models are generally less precise, meaning that they tend to have larger standard errors.
How do we find instruments?
That's not an easy question because you need to make a good case as to why your $z_i$ would not be correlated with $\epsilon_i$ - this cannot be tested formally because the true error is unobserved. The main challenge is therefore to come up with something that can be plausibly seen as exogenous such as natural disasters, policy changes, or sometimes you can even run a randomized experiment. The other answers had some very good examples for this so I won't repeat this part.
For the binary case (both treatment and instrument) estimating the local average treatment effect (LATE) is straightforward, and you can estimate it as $$E(Y_{i1} - Y_{i0}|D_{i0}=0, D_{i1}=1) = \frac{E[Y_i|Z_i=1] - E[Y_i|Z_i=0]}{P[D_i=1|Z_i = 1] - P[D_i=1|Z_i = 0]} $$
So how does this compare to the multivalued instrument case: first of all, the conditions for identification of a LATE are very similar to the binary case. One additional requirement is strict monotonicity. Suppose your $Z_i$ has a finite support and takes values from $0,...,J$, and you have a binary, endogenous treatment $D_i$, then the requirement on the first stage is $$P(D_i = 1|Z_i = j) > P(D_i = 1|Z_i = j-1)$$ so the higher your value of the instrument the higher is the probability that you get treated.
Also suppose that individuals with the lowest value of the instrument have $D_{i0}=0$ and conversely those with the highest value have $D_{iJ} = 1$. What your instrumental variables estimator will give you in this case is a weighted average of Wald ratios, $$E(Y_{i1} - Y_{i0}|D_{i0}=0,D_{iJ}=1) = \sum^J_{j=1}\mu_j \cdot \text{wald}_{j,j-1} $$ where $$\text{wald}_{j,j-1} = \frac{E[Y_i|Z_i = j] - E[Y_i|Z_i = j-1]}{P[D_i = 1|Z_i = j] - P[D_i = 1|Z_i = j-1]}$$ and $$\mu_j = \frac{P[D_i = 1|Z_i = j] - P[D_i = 1|Z_i = j-1]}{\sum^J_{j=1}P[D_i = 1|Z_i = j] - P[D_i = 1|Z_i = j-1]} $$ are the weights which sum to one.
So you do lots of pairwise comparisons between the $J$ subgroups of individuals where you always compare group $j$ with group $j-1$, which is why the above stated monotonicity condition is needed. The proof for all of this is rather lengthy and annoying so I would like to avoid it but from the statement you already see why multivalued instruments are not necessarily well liked because they are hard to interpret. That's because the average treatment effect you estimate here is the average of treatment effects in each of the $J$ subgroups of compliers.
A critical discussion of the LATE framework in general is given by Deaton (2009) and Heckman and Urzua (2009) with a response by Guido Imbens [link]. Another discussion regards whether discretizing even highly continuous instruments rather than estimating a weighted average of Wald ratios is better (in the sense of being less biased) but I haven't seen any paper which would settle this debate. Nonetheless I hope this helps to clear up for you what you are getting into when you use multivalued instruments in the LATE framework.
Best Answer
An instrumental variable is used to help estimate a causal effect (or to alleviate measurement error). The instrumental variable must affect the independent variable of interest, and only affect the dependent variable through the independent variable of interest. The second part (only effecting the dependent variable through the independent variable) is called an exclusion restriction.
A proxy variable is a variable you use because you think it is correlated with the variable you are really interested in, but have no (or poor) measurement of.