As for the last question: otherwise, you don't have a critical point and there is nothing to test. :-) Think with one variable: would you look for a maximum or minimum if $f'(x_0) \neq 0$?
Your intuitive understanding of the Hessian points in the right direction. The point is: how to "sum up" all the data $f_{xx}, f_{xy} = f_{yx}, f_{yy}$ in just one single fact?
Well, thinking about the quadratic form that the Hessian defines. Namely,
$$
q(x,y) =
\begin{pmatrix}
x & y
\end{pmatrix}
\begin{pmatrix}
f_{xx} & f_{xy} \\
f_{yx} & f_{yy}
\end{pmatrix}
\begin{pmatrix}
x \\
y
\end{pmatrix}
=
f_{xx}x^2 + 2 f_{xy}xy + f_{yy}y^2 \ .
$$
If this quadratic form is positive-definitive, that is $q(x,y) > 0$ for all $(x,y) \neq (0,0)$, then $f$ has a local minimum at the point where this happens (just as in the one-variable case, $f''(x_0) > 0$ implies $f$ has a local minimum at $x_0$).
It's more or less obvious that for $q(x,y)$ to be positive or not at all points doesn't depend on the coordinate system you're using, isn't it?
Right, then do the following experiment: you have a nice quadratic form like
$$
q(x,y) = x^2 + y^2
$$
which is not ashamed to show clearly that she is positive-definite, is she?
Then, do to her the following linear change of coordinates:
$$
\begin{pmatrix}
x \\ y
\end{pmatrix}
=
\begin{pmatrix}
1 & 1 \\
1 & 0
\end{pmatrix}
\begin{pmatrix}
\overline{x} \\
\overline{y}
\end{pmatrix}
$$
and you'll get
$$
q(\overline{x}, \overline{y}) = 2\overline{x}^2 + 2 \overline{x}\overline{y} + \overline{y}^2 \ .
$$
Is now also clear that $q(\overline{x}, \overline{y}) > 0$ for all $ (\overline{x}, \overline{y}) \neq (0,0)$?
So, we need some device that allows us to show when a symmetric matrix like $H$ will define a positive-definite quadratic form $q(x,y)$, no matter if the fact is disguised because we are using the wrong coordinate system.
One of these devices are the eigenvalues of $H$: if all of them are positive, we know that, maybe after a change of coordinate system, our $q(x,y)$ will have an associate matrix like
$$
\begin{pmatrix}
\lambda & 0 \\
0 & \mu
\end{pmatrix}
$$
with $\lambda, \mu > 0$. Hence, in some coordinate system (and hence, in all of them), our $q > 0$.
Suppose function $f : \mathbb{R}^d\rightarrow \mathbb{R}$ is twice differentiable over its domain. We want to prove $\forall x: \nabla^2 f(x)\succeq 0$ if and only if $f(\cdot)$ is convex.
Convexity $\Rightarrow$ Positive semi-definite Hessian
\
The first order characterisation of convexity is:
$$f(y)\ge f(x) + \nabla f(x)^\top (y-x)$$
(i) one dimensional case: $d=1$
For $d=1$ we only need to prove $f''(x)\ge 0$. Pick two arbitrary points $x,y$, and wlog assume $y>x$. Using convexity we have
$$f(y) \ge f(x) + f'(x)(y-x)$$
If we switch the variables $x,y$ and rewrite the equation we get
$$f(y) \le f(x) + f'(y)(y-x)$$
Combining the two:
$$f(x) + f'(x)(y-x) \le f(x) + f'(y)(y-x)$$
and finally by cancelling the two $f(x)$ terms and dividing by $y-x$ (assumed to be positive) we'll get:
$$f'(x) \le f'(y)$$
Meaning the function $f'(x)$ must be monotonically non-decreasing.
Now we can prove that $f''(x)\ge 0$, using the definition of a derivative:
$$f''(x) = \lim_{h\rightarrow 0} \frac{f'(x+h)-f'(x)}{h}$$
if we have $f''(x)<0$ then there must be $h>0$ such that $f'(x+h)-f'(x)<0$ because of the convergence of the limit. However this contradicts the result we got before that derivative must be monotonically non-decreasing.
(ii) General case $d>1:$
Now going back to the general case, lets assume we have an arbitrary point $x$ and direction $v$ in $\mathbb{R}^d$. Now define $g: \mathbb{R}\rightarrow\mathbb{R}$:
$$g(t) := f(x + t v) $$
It's easy to prove that $g(\cdot)$ is convex and twice differentiable. Using chain rule, we can compute the second derivative as:
$$g''(t) = v^\top \nabla ^2 f(x + t v ) v$$
Using the result from $d=1$, convexity of $g$ implies $g''(t)\ge 0$ for all $t$. In particular for $t=0$:
$$v^\top \nabla^2 f(x) v=g''(0) \ge 0$$
Now because $v$ was chosen arbitrarily, it means in every direction the term must be positive, which implies semipositive definiteness of $\nabla^2f(x)$.
PSD Hessian $\Rightarrow$ Convexity
The proof strategy is very similar. First we prove it for $d=1$ case and then generalise it using the same trick.
Best Answer
The proof of the second derivative test at a critical point ($Df_a = 0$) runs as follows: for a given sufficiently smooth map $f: \Bbb{R}^n \to \Bbb{R}$, and a point $a \in \Bbb{R}^n$, we write a second order Taylor expansion at the point $a$: \begin{align} f(a+h) - f(a) &= \dfrac{1}{2}(D^2f_a)(h,h) + o(\lVert h\rVert^2). \end{align} In other words, there is a "remainder term", which is a function $\rho$, such that $\lim_{h \to 0} \rho(h) = 0$, and \begin{align} f(a+h) - f(a) &= \dfrac{1}{2}(D^2f_a)(h,h) + \rho(h) \lVert h\rVert^2. \end{align} If the Hessian $D^2f_a$ is positive definite say, then there is a positive constant $\lambda$ such that for all $h \in \Bbb{R}^n$, $D^2f_a(h,h) \geq \lambda \lVert h\rVert^2$ (with equality if and only if $h=0$). Hence, \begin{align} f(a+h) - f(a) &\geq \dfrac{\lambda}{2} \lVert h\rVert^2 + \rho(h) \lVert h\rVert^2 \\ &= \left( \dfrac{\lambda}{2} + \rho(h)\right) \lVert h\rVert^2. \end{align} Since $\rho(h) \to 0$ as $h \to 0$ and $\lambda > 0$, the term in brackets will be strictly positive if $h$ is sufficiently small in norm. Hence, for all $h$ sufficiently small in norm, $f(a+h) - f(a) \geq 0$ (with equality if and only if $h =0$). This is the proof for why a positive-definite Hessian implies you have a strict local minimum at a critical point $a$.
Of course, a similar proof holds for a negative-definite Hessian implying a strict local maximum.
Roughly speaking, the idea of the proof is that the local behaviour of $f(a+h) - f(a)$ is entirely determined by the behaviour of the Hessian, in the term $D^2f_a(h,h)$ (because the error term is "small"). So, to answer your questions,
The proof of the theorem above shows that we need to ensure that the entire term $D^2f_a(h,h)$ is positive (in fact bounded below by a positive multiple of $\lVert h \rVert^2$), so that we can conclude that $f(a+h) - f(a) \geq 0$. But just because an $n \times n$ matrix has all positive entires, it doesn't mean it is positive-definite (Robert's answer gives an explicit counter example).
Hopefully the proof I gave above justifies why definiteness comes into play (it's to ensure you have a good lower/upper bound on the $D^2f_a(h,h)$ term).
A matrix is positive(negative) definite if and only if all its eigenvalues are strictly positive (strictly negative). If there are some positive and some negative, then the matrix is indefinite. If this is the case for your Hessian, it means you have a saddle point (because the function is increasing along some directions while decreasing along others).