The reference for this should be Newey (1987) "Efficient estimation of limited dependent variable models with endogenous explanatory variables", Journal of Econometrics, Vol. 36(3), pp. 231–250 (link). This is the estimator that is implemented with the probitiv
command in Stata, for instance, where you can have an OLS first stage and probit second stage.
Start from the structural model,
$$y_i = \alpha + \beta X_i + \epsilon_i$$
where the explanatory variable of interest $X_i$ has a correlation with the error term, $Cov(X_i,\epsilon_i)\neq 0$. In this case, you know that you won't recover an unbiased estimate such that $\widehat{\beta}\rightarrow \beta$. Now assume that you have an instrument $Z_i$ which is such that $Cov(X_i,Z_i)\neq0$ and $Cov(Z_i,\epsilon_i)=0$. These are the assumptions on instrument relevance and the exclusion restriction.
Take the covariance of each sides of the above equation with respect to $Z_i$, and you get
$$
\begin{align}
Cov(y_i,Z_i) &= \beta Cov(X_i,Z_i) + Cov(\epsilon_i, Z_i)\\[0.5em]
\beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)}
\end{align}
$$
which uses the fact that the covariance between a random variable and a constant is zero, as well as the previous exclusion restriction. In fact, we just derived the expression of the IV estimator. The population coefficient $\beta$ can be recovered by dividing the "reduced form" (regression of $y_i$ on $Z_i$) coefficient by the first stage (regression of $X_i$ on $Z_i$) coefficient.
How does this relate to my answer in the other post? The denominator of the above fraction can be obtained by regressing,
$$
X_i = \delta + \pi Z_i + \eta_i
$$
Now you see that we have an expression for $X_i$ as a linear function of the instrument. If you plug this into the very first equation, you get the so-called reduced form equation:
$$
\begin{align}
y_i &= \alpha + \beta X_i + \epsilon_i \\
&= \alpha + \beta (\delta + \pi Z_i + \eta_i) + \epsilon_i \\
&= (\alpha + \beta\delta) + \beta \pi Z_i + \beta\eta_i\epsilon_i
\end{align}
$$
So the ratio of the reduced form coefficient on $Z_i$ over the first stage coefficient is indeed
$$
\begin{align}
\beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)} \\
&= \frac{\beta\pi}{\pi}\\
&= \beta
\end{align}
$$
the causal effect of interest. This is the maths behind it. I hope that this together with the other answer gives you a better intuition on how an instrumental variable can be used to extract "exogenous" variation (under the stated assumptions) from the original $X_i$ to identify the parameter of interest.
Best Answer
Technically, you are actually regressing $[X\;,\; W]$ on $[Z\;,\; W]$ so the resulting fitted values for the second stage regressors are $[\hat X\;,\; \hat W]=[\hat X \;,\;W]$.
$\hat W =W$ since the best prediction of $W$ available in the matrix $[Z\;,\; W]$ is obviously $W$ itself.
But the trivialities aside, $W$ is included in the first stage regressors because it is exogenous and so excluding $W$ would lead to a loss in efficiency or consistency (most likely both) of the 2SLS estimator. In other words, the purpose of the first stage is to sort of "devide out" the endogenous part of the $X$'s in that $\hat X$ is the part of $X$ which can be associated solely with exogenous movements (i.e. changes in $Z$ and $W$). If $X$ and $W$ are correlated at all, not including $W$ here would result in a large loss of information since the resulting fitted values would not reflect all the exogenous movement in $X$.
$W$ is included in the second stage to avoid omitted variable bias in the 2SLS coefficient estimates. At this point $\hat X$ is almost surely correlated with $W$ and so if $W$ has any effect on $Y$, leaving it out of the regression will result in bias coefficient estimates.