I do not have access to all the references, but I would like to point out a few remarks on some of your points:
Borovkov, Mathematical Statistics (1998). p. 140 presents another assumption, Condition (R), which is quite strong. This condition assumes that $E[(\partial\log f(x;\theta)/\partial\theta)^2]<\infty$. Then, the author basically assumes that each entry of the Fisher information matrix (FIM) is well-defined.
The double differentiability and exchangeability of the integral and differential operators assumptions are employed to deduce the equality $E[(\partial\log f(x;\theta)/\partial\theta)^2]=-E[\partial^2\log f(x;\theta)/\partial\theta^2]$. This equality is often helpful, but not strictly necessary.
It is difficult to establish general conditions for the existence of the FIM without discarding some models for which the FIM actually exists. For instance, differentiability condition is not a necessary condition for the existence of the FIM. An example of this is the double exponential or Laplace model. The corresponding FIM is well defined, but the density is not doubly differentiable at the mode. Some other models which are doubly differentiable have bad-behaved FIM and require some additional conditions (see this paper).
It is possible to come up with very general sufficient conditions, but they might be too strict. Necessary conditions for the existence of the FIM have not been fully studied. Then, the answer to your first question may not be simple.
The Fisher Information is defined as
$${\left(\mathcal{I} \left(\theta \right) \right)}_{i, j}
=
\operatorname{E}
\left[\left.
\left(\frac{\partial}{\partial\theta_i} \log f(X;\theta)\right)
\left(\frac{\partial}{\partial\theta_j} \log f(X;\theta)\right)
\right|\theta\right]$$
(the question in the post you linked to states mistakenly otherwise, and the answer politely corrects it).
Under the following regularity conditions:
1) The support of the random variable involved does not depend on the unknown parameter vector
2) The derivatives of the loglikelihood w.r.t the parameters exist up to 3d order
3) The expected value of the squared 1st derivative is finite
and under the assumption that
the specification is correct (i.e. the specified distribution family includes the actual distribution that the random variable follows)
then the Fisher Information equals the (negative of the) inverted Hessian of the loglikelihood for one observation. This equality is called the "Information Matrix Equality" for obvious reasons.
While the three regularity conditions are relatively "mild" (or at least can be checked), the assumption of correct specification is at the heart of the issues of statistical inference, especially with observational data. It simply is too strong a condition to be accepted easily. And this is the reason why it is a major issue to prove that the log-likelihood is concave in the parameters (which leads in many cases to consistency and asymptotic normality irrespective of whether the specification is correct -the quasi-MLE case), and not just assume it by assuming that the Information Matrix Equality holds.
So you were absolutely right in thinking "too good to be true".
On the side, you neglected the presence of the minus sign -so the Hessian of the log-likelihood (for one observation) would be negative-semidefinite, as it should since we seek to maximize it, not minimize it.
Best Answer
Check this out: http://en.wikipedia.org/wiki/Fisher_information#Matrix_form
From the definition, we have
$$ I_{ij} = \mathrm{E}_\theta \left[ \left(\partial_i \log f_{X\mid\Theta}(X\mid\theta)\right) \left(\partial_j \log f_{X\mid\Theta}(X\mid\theta)\right)\right] \, , $$ for $i,j=1,\dots,k$, in which $\partial_i=\partial /\partial \theta_i$. Your expression for $I_{ij}$ follows from this one under regularity conditions.
For a nonnull vector $u = (u_1,\dots,u_k)^\top\in\mathbb{R}^n$, it follows from the linearity of the expectation that $$ \sum_{i,j=1}^k u_i I_{ij} u_j = \sum_{i,j=1}^k \left( u_i \mathrm{E}_\theta \left[ \left(\partial_i \log f_{X\mid\Theta}(X\mid\theta)\right) \left(\partial_j \log f_{X\mid\Theta}(X\mid\theta)\right)\right] u_j \right) \\ = \mathrm{E}_\theta \left[ \left(\sum_{i=1}^k u_i \partial_i \log f_{X\mid\Theta}(X\mid\theta)\right) \left(\sum_{j=1}^k u_j \partial_j \log f_{X\mid\Theta} (X\mid\theta)\right)\right] \\ = \mathrm{E}_\theta \left[ \left(\sum_{i=1}^k u_i \partial_i \log f_{X\mid\Theta}(X\mid\theta)\right)^2 \right] \geq 0 \, . $$
If this component wise notation is too ugly, note that the Fisher Information matrix $H=(I_{ij})$ can be written as $H = \mathrm{E}_\theta\left[S S^\top\right]$, in which the scores vector $S$ is defined as $$ S = \left( \partial_1 \log f_{X\mid\Theta}(X\mid\theta), \dots, \partial_k \log f_{X\mid\Theta}(X\mid\theta) \right)^\top \, . $$
Hence, we have the one-liner $$ u^\top H u = u^\top \mathrm{E}_\theta[S S^\top] u = \mathrm{E}_\theta[u^\top S S^\top u] = \mathrm{E}_\theta\left[|| S^\top u ||^2\right] \geq 0. $$