This is an interesting [and very far from "stupid"] question that actually bothered me for a while! We cover it in Monte Carlo Statistical Methods (Section 3.3.2, pages 95-96). The crux of it is that, by dividing by the sum of the weights the optimality vanishes. It is actually easy to see when $h$ is a positive function. In this case,
$$
w(x) h(x) = 1
$$
and
$$
w(x) = \frac{1}{h(x)}
$$
so
$$
\widehat{\mathbb{E}[h(X)]} = \dfrac{1}{\frac{1}{n}\sum_{i=1}^n \frac{1}{h(x_i)}}
$$
which is the dreaded harmonic mean estimator (see also this great and definitive post by Radford Neal). The estimator is consistent (in the sense of the Law of Large Numbers) but it is likely to have an infinite variance (which takes us very far from the minimum variance optimality of the original estimator!).
The fundamental reason why optimality does not transfer is that the variance of the ratio is quite different from the variance of the original importance sampling estimate and thus is not optimised for the same importance function. Sadly, since there is no closed form expression for the variance of the ratio (only delta methods approximations are available), there is no definite result on the optimal solution $g$. Of course, one could use different optimal importance functions for top and bottom, but this does not lead anywhere in practice!
This part primarily relates to your first, third and fourth question:
There's a fundamental difference between Bayesian statistics and frequentist statistics.
Frequentist statistics makes inference about which fixed parameter values are consistent with data viewed as random, usually via the likelihood. You take $\theta$ (some parameter or parameters) as fixed but unknown, and see which ones make the data more likely; it looks at the properties of sampling from some model given the parameters to make inference about where the parameters might be. (A Bayesian might say the frequentist approach is based on 'the frequencies of things that didn't happen')
Bayesian statistics looks at the information on parameters in terms of a probability distribution on them, which is updated by data, via the likelihood. Parameters have distributions, so you look at $P(\theta|\underline{x})$.
This results in things which often look similar but where the variables in one look "the wrong way around" viewed through the lens of the other way of thinking about it.
So, fundamentally they're somewhat different things, and the fact that things that are on the LHS of one are on the RHS of the other is no accident.
If you do some work with both, it soon becomes reasonably clear.
The second question seems to me to relate simply to a typo.
---
the statement "equivalent to the usual frequentist sampling distribution, that is" : I took this to mean that the authors were stating the frequentist sampling distribution. Have I read this wrongly?
There's two things going on there - they've expressed something a bit loosely (people do this particular kind of over-loose expression all the time), and I think you're also interpreting it differently from the intent.
What exactly does the expression they give mean, then ?
Hopefully the discussion below will help clarify the intended sense.
If you can provide a reference (pref. online as I don't have good library access) where this expression is derived I would be grateful.
It follows right from here:
http://en.wikipedia.org/wiki/Bayesian_linear_regression
by taking flat priors on $\beta$ and I think a flat prior for $\sigma^2$ as well.
The reason is that the posterior is thereby proportional to the likelihood and the intervals generated from the posteriors on the parameters match the frequentist confidence intervals for the parameters.
You might find the first few pages here helpful as well.
Best Answer
This is a most interesting if exotic case of a posterior distribution with atoms!
The difficulty in solving the question is about defining a density for the observation $Y$ against the proper measure. Since $Y$ given $X=x$ takes the values $\pm x$ with probability $\Phi(-x)$ and $x$ takes any real value, it seems impossible to use a counting measure. However, since $Y/x$ takes the values $\pm 1$ with probability $\Phi(-x)$, $Z=Y/x$ has the (conditional) density $$x\varphi(xz)\mathbb{I}_{(-1,1)}(z)+\Phi(-x)\mathbb{I}_{\{-1,1\}}(z)$$ hence $Y$ has the (conditional) density $$\varphi(y)\mathbb{I}_{(-x,x)}(y)+\Phi(-x)\mathbb{I}_{\{-x,x\}}(y)$$ Therefore the posterior distribution on $X$ is $$\varphi(x)\times\left\{\varphi(y)\mathbb{I}_{(-x,x)}(y)+\Phi(-x)\mathbb{I}_{\{-x,x\}}(y)\right\}$$ or $$\varphi(x)\mathbb{I}_{x>|y|}+\Phi(-|y|)\mathbb{I}_{x=|y|}$$ since $\varphi(|y|)$ cancels out. This is a simple mixed distribution made of a truncated normal and a point mass at $|y|$, for which importance sampling (or another Monte Carlo approach) is not necessary.
From a simulation perspective, if importance sampling is contemplated, this means that the importance sampling distribution must have an atom at $|y|$ with probability $\varrho$ say, plus an absolutely continuous component on $\{x>|y|\}$ with probability $(1-\varrho)$, $h(x)$ say. This leads to an importance weight of the form $$\omega(x)=\dfrac{\varphi(x)\mathbb{I}_{x>|y|}+\Phi(-|y|)\mathbb{I}_{x=|y|}}{(1-\varrho) h(x)+\varrho\mathbb{I}_{x=|y|}}$$ For instance, if $\varrho=\Phi(-|y|)$, we have $$\omega(x)=\begin{cases} \dfrac{\varphi(x)}{(1-\varrho) h(x)} &\text{ if }x\ne|y|\\1 &\text{ if }x=|y|\\ \end{cases}$$