I would like to explain the way I understood it, explaining each and every step on the way.
Assumptions:
- $g(x,t)$ is a function of x and t.
- $p(x,t)$ is a joint distribution over $x$ and $t$.
Basic formulas:
$$\mathbb{E}_t[g|x] = \int_t{g(x,t)p(t|x)\mathop{dt}} \ (\mathbb{E}_t[g|x] \text{ is a function of $x$ and constant w.r.t. } t) \tag{1}\label{1} $$
$$\mathbb{E}_t[t|x] = \int_t{t.p(t|x)\mathop{dt}} \tag{2}\label{2}$$
$\operatorname{var}_t[t|x] = \int_t{(t - \mathbb{E}_t[t|x])^2p(t|x)\mathop{dt}} = \mathbb{E}_t[(t - \mathbb{E}_t[t|x])^2 | x] \tag{3}\label{3}$
$$
\eqalign{\mathbb{E}_t[f(x)g(x,t)|x]
&= \int_t{f(x)g(x,t)p(t|x)\mathop{dt}} \\
&= f(x)\int_t{g(x,t)p(t|x)\mathop{dt}} \\
&= f(x) \ \mathbb{E}_t[g|x] }
\tag{4}\label{4}$$
$$\mathbb{E}_t[f(x)|x] = f(x) \tag{4a}\label{4a}$$
$$\mathbb{E}_{x,t}[g] = \mathbb{E}_x[\mathbb{E}_t[g|x]] \tag{5}\label{5}$$
We derive the last formula above.
$$
\eqalign{
\mathbb{E}_{x,t}[g]
&= \int_x\int_tg(x,t)p(x,t)\mathop{dx}\mathop{dt}\\
&= \int_x\int_tg(x,t)p(x)p(t|x)\mathop{dx}\mathop{dt}\\
&= \int_x p(x)\int_t g(x,t)p(t|x)\mathop{dt}\mathop{dx}\\
&= \int_x \mathbb{E}_t[g|x]p(x)\mathop{dx} \text{ (using \ref{1}) } \\
&= \mathbb{E}_x [\mathbb{E}_t[g|x]] \\
}
$$
Derivation of the expected loss:
Represent the Loss function in the form as below. Please notice the subscript $t$ in the $\mathbb{E}_t$ notations. This was omitted in the book, but I added it here for clarity.
$$
\eqalign{
L(x,t) &= (y(x)-t)^{2} \\
&= (y(x) - \mathbb{E}_t[t|x])^{2} + 2(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t) + (\mathbb{E}_t[t|x]-t)^{2} \\
&= L_1 + 2L_2 + L_3
}
$$
Hence the joint expectation can be represented as:
$$
\eqalign{
\mathbb{E}_{x,t}[L]
&= \mathbb{E}_{x,t}[L_1] + 2\mathbb{E}_{x,t}[L_2] + \mathbb{E}_{x,t}[L_3]
}
$$
We derive the 3 expectations:
$$
\eqalign{
\mathbb{E}_{x,t}[L_1]
&= \mathbb{E}_{x,t}[(y(x) - \mathbb{E}_t[t|x])^{2}] \\
&= \mathbb{E}_x[ \ \mathbb{E}_t[(y(x) - \mathbb{E}_t[t|x])^{2} | x] \ ] \ \ \text{ (using \ref{5}) } \\
&= \mathbb{E}_x[(y(x) - \mathbb{E}_t[t|x] )^{2}] \text{ (using \ref{4a}, as the operand is a function of $x$ only)} \\
}
$$
$$
\eqalign{
\mathbb{E}_{x,t}[L_2]
&= \mathbb{E}_{x,t}[(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t)] \\
&= \mathbb{E}_x[ \ \mathbb{E}_t[\{(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t)\} \ | \ x] \ ] \text{ (using \ref{5}) } \\
&= \mathbb{E}_x[ \ (y(x) - \mathbb{E}_t[t|x]) \ \mathbb{E}_t[(\mathbb{E}_t[t|x]-t) | x] \ \ ] \ \text{ (using \ref{4} on the inner expectation)}
}
$$
Considering only the inner expectation:
$$
\eqalign{
\mathbb{E}_t[(\mathbb{E}_t[t|x]-t) | x]
&= \mathbb{E}_t[\mathbb{E}_t[t|x] | x] - \mathbb{E}_t[t|x] \text{ (using inearity of $\mathbb{E}$)} \\
&= \mathbb{E}_t[t|x] - \mathbb{E}_t[t|x] \text{ (using \ref{4a} as $\mathbb{E}_t[t|x]$ is a function of $x$)}\\
&= 0
}
$$
Therefore,
$$
\eqalign{
\mathbb{E}_{x,t}[L_2]
&= \mathbb{E}_x[ \ (y(x) - \mathbb{E}_t[t|x]) \ \cdot \ 0 \ ] \\
&= 0
}
$$
$$
\eqalign{
\mathbb{E}_{x,t}[L_3]
&= \mathbb{E}_{x,t}[(\mathbb{E}_t[t|x]-t)^{2}] \\
&= \mathbb{E}_x[ \ \mathbb{E}_t[(\mathbb{E}_t[t|x]-t)^{2} | x] \ ] \text{ (using \ref{5}) } \\
&= \mathbb{E}_x[\operatorname{var}_t[t|x]] \text{ (using \ref{3}) }
}
$$
Putting them all together and expressing the $\mathbb{E}_x$ terms as integrals under $x$, we get the following form:
$$
\mathbb{E}_{x,t}[L] = \int_x (y(x) - \mathbb{E}_t[t|x])^2 p(x)\mathop{dx} + \int_x \operatorname{var}_t[t|x] p(x) \mathop{dx}
$$
Note: As mentioned by @Juho Kokkalla, the erroneous last term in the book is corrected in the errata.
You got off on the wrong track as detailed here. Just because you have a binary $Y$ it doesn't mean that you should be interested in classification. You are really interested in a probability model, so logistic regression is a good choice. Get the nomenclature right or you will confuse everyone.
To the main point, the theory of statistical estimation shows that in the absence of outside information (which would make you use Bayesian logistic regression), maximum likelihood estimation is the gold standard for efficiency and bias. The log likelihood function provides the objective function.
You may have confused a loss/cost/utility function with estimation optimization. Get the optimum estimates using maximum likelihood estimation or penalized maximum likelihood (or better Bayesian modeling if you have constraints or other information). The a utility function comes in when needing to make an optimum decision to minimize expected loss (maximize expected utility). But I don't think you are asking about decision analysis. So stick with the gold standard objective function - the log likelihood.
Best Answer
The expression is correct but only for logistic regression where the outcome is $+1$ or $-1$ [i.e. $y(i) = 1$ or $-1$].
If $y(i) = 1$ or $-1$, $y(i)^2$ is always one.
You can expand and simplify the $h(\theta)$ expressions to show: \begin{align}H(\theta)[-y(i)x(i)]{1-H(\theta)[-y(i)x(i)]} &= \frac1{1+\exp[-y(i)x(i)]} \cdot \frac1{1+\exp[y(i)x(i)]} \\&= \frac1{1+\exp[-x(i)]} \cdot \frac1{1+\exp[x(i)]} \end{align}
if $y(i)$ is $1$ or $-1$.
$$\frac1{1+\exp[x(i)]} \cdot \frac1{1+\exp[x(i)]}$$ is equal to the last the h(theta) expressions in the original photo, and given that $y(i)^2$ is always one, this proves your second expression is equal to the first in the special case when $y(i)$ is $1$ or $-1$.
Hope this helps.