It turns out that twice-differentiability implies that the Hessian is symmetric even without convexity and with no reference to whether the second-order partial derivatives are continuous! The proof below is based on Theorem 8.12.2 in the book Foundations of Modern Analysis by Dieudonné (1969, p. 180).
Claim: Let $U\subseteq\mathbb R^n$ be an open set and $f:U\to\mathbb R$ a function. Suppose that $f$ is (Fréchet) differentiable on $U$ and that it is twice (Fréchet) differentiable at $\mathbf x_0\in U$. Then, the Hessian matrix $\mathbf H(\mathbf x_0)$ at $\mathbf x_0$ is symmetric.
Proof: Let $\mathbf D:U\to\mathbb R^n$ denote the gradient function of $f$. Fix $\varepsilon>0$. Since $\mathbf D$ is Fréchet differentiable at $\mathbf x_0$ by assumption, it follows that there exists some $\delta>0$ such that $\|\mathbf v\|<4\delta$ implies that
$$\left\|\mathbf D(\mathbf x_0+\mathbf v)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot\mathbf v\right\|\leq\varepsilon\|\mathbf v\|.$$
There is no loss of generality in taking $\delta$ to be so small that the open ball $B(4\delta,\mathbf x_0)$ is contained in the open set $U$.
For any $i,j\in\{1,\ldots,n\}$, let $\mathbf e_i$ and $\mathbf e_j$ be the corresponding standard basis vectors of unit length. Let $\mathbf s\equiv\delta\mathbf e_i$ and $\mathbf t\equiv\delta\mathbf e_j$. It is clear that $\mathbf x_0+\xi\mathbf s+\mathbf t$ and $\mathbf x_0+\xi\mathbf s$ are both in $U$ whenever $\xi\in[0,1]$; this is because $\|\xi\mathbf s+\mathbf t\|<4\delta$ and $\|\xi\mathbf s\|<4\delta$. Define the following function $g:[0,1]\to\mathbb R$:
$$g(\xi)\equiv f(\mathbf x_0+\xi\mathbf s+\mathbf t)-f(\mathbf x_0+\xi\mathbf s)\quad\forall\xi\in[0,1].$$
Clearly, $g$ is continuous on $[0,1]$ and differentiable on $(0,1)$. Lagrange's mean-value theorem, in turn, implies that there exists some $\xi\in(0,1)$ such that
$$g(1)-g(0)=g'(\xi)=\mathbf s\cdot\left[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0+\xi\mathbf s)\right],$$
using the chain rule.
Next, one can derive the following chain of inequalities (the first one uses the Cauchy–Schwarz inequality):
\begin{align*}
&\left|g(1)-g(0)-\mathbf s\cdot\mathbf H(\mathbf x_0)\cdot\mathbf t\right|\leq\underbrace{\|\mathbf s\|}_{=\delta}\left\|[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0)]-[\mathbf D(\mathbf x_0+\xi\mathbf s)-\mathbf D(\mathbf x_0)]-\mathbf H(\mathbf x_0)\cdot\mathbf t\right\|\\
=&\,\delta\left\|[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot(\xi\mathbf s+\mathbf t)]-[\mathbf D(\mathbf x_0+\xi\mathbf s)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot(\xi\mathbf s)]\right\|\\
\leq&\,\delta\varepsilon\left(\|\xi\mathbf s+\mathbf t\|+\|\xi\mathbf s\|\right)<8\delta^2\varepsilon.
\end{align*}
That is, one has that
$$|f(\mathbf x_0+\mathbf s+\mathbf t)-f(\mathbf x_0+\mathbf s)-f(\mathbf x_0+\mathbf t)+f(\mathbf x_0)-\delta^2\mathbf e_i\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_j|<8\delta^2\varepsilon,$$
and, by a completely analogous and symmetric reasoning in which $\mathbf s$ and $\mathbf t$ are interchanged,
$$|f(\mathbf x_0+\mathbf s+\mathbf t)-f(\mathbf x_0+\mathbf s)-f(\mathbf x_0+\mathbf t)+f(\mathbf x_0)-\delta^2\mathbf e_j\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_i|<8\delta^2\varepsilon.$$
Given that $\mathbf e_i\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_j=h_{ij}(\mathbf x_0)\equiv\partial^2 f/(\partial x_i\partial x_j)(\mathbf x_0)$, the preceding two inequalities imply that
$$\left|h_{ij}(\mathbf x_0)-h_{ji}(\mathbf x_0)\right|<16\varepsilon.$$
Taking $\varepsilon$ to be arbitrarily small, one sees that $h_{ij}(\mathbf x_0)=h_{ji}(\mathbf x_0)$. $\blacksquare$
$
\newcommand\R{\mathbb R}
\newcommand\Hom{\mathrm{Hom}}
$I think you are getting confused by the use of matrices/coordinates here,
which are completely unecessary.
So let us do away with them.
There's is also no reason to restrict our discussion to $\R^n$
since virtually nothing changes if just use a general vector space,
and in fact we will need this
So let $V, W$ be finite-dimensional (real) normed vector spaces
with norms $||\cdot||_V$ and $||\cdot||_W$.
The choice of norms is actually irrelevant
since all norms on a given finite dimensional vector space
generate the same topology, meaning in particular that the notion of
convergence is independent of the norm chosen.
Now let $f : V \to W$ and choose $x \in V$.
The differential of $f$ at $x$
(specifically called the Fréchet derivative,
but I would personally not prefer the term "derivative" here)
is the unique linear transformation $Df_x : V \to W$
which "best approximates $f$"; specifcally
$$
\lim_{||h||_V \to 0}\frac{||f(x+h) - f(x) - Df_x(h)||_W}{||h||_V} = 0,
$$
or in other words $[f(x+h)-f(x)]/||h||_V$ converges to $Df_x(h)/||h||_V$
under the norm $||\cdot||_W$ as $||h||_V \to 0$.
For $\epsilon \in \R$ and fixed $h$ it follows that
$$
Df_x(h) = \lim_{\epsilon\to0}\frac{f(x+\epsilon h)-f(x)}\epsilon.
$$
Another way to state this is that if $||h||_V$ is "small enough" then
$$
f(x + h) \approx f(x) + Df_x(h).
$$
In this sense $Df_x(h)$ is the "change of $f$ in the direction $h$".
If we were to chooses bases for $V, W$,
then we could represent out vectors as column vectors and $Df_x$ as a matrix.
If $V = \R^m$ and $W = \R^n$ and we choose the standard bases,
the matrix of $Df_x$ is the Jacobian matrix.
The space $\Hom(V, W)$ of linear transformations
is itself a finite dimensional vector space,
so we can also differentiate with it.
In particular $x \mapsto Df_x$ is a function $V \to \Hom(V, W)$;
Thus the second differential $D^2f_x$ of $f$ at $x$
is a multilinear function $V\times V \to W$ defined by
$$
D^2f_x(h, k) = D[y \mapsto Df_y]_x(h)(k).
$$
Note that $D[y \mapsto Df_y]_x : V \to \Hom(V, W)$,
so first we evaluate on $h$ to get $D[y \mapsto Df_y]_x(h) : V \to W$,
and then on $k$ to finally get $D^2f_x(h, k) \in W$.
This is the best second-order, bilinear approximation to $f$,
in the sense that if $h, k \in V$ with $||h||_V, ||k||_V$ "small"
then simply using the definition of the differential gives
$$
f(x + k + h)
\approx f(x + k) + Df_{x+k}(h)
\approx f(x) + Df_x(k) + Df_x(h) + D^2f_x(h,k).
$$
We can also unravel the definition of $D^2f_x$ to get a limit expression
for $D^2f_x(h,k)$ with fixed $h, k$:
$$\begin{aligned}
D^2f_x(h,k)
&= \lim_{\eta\to0}\frac{Df_{x+\eta k}(h) - Df_x(h)}\eta
\\
&= \lim_{\eta\to0}\lim_{\epsilon\to0}
\frac{f(x+\eta k+\epsilon h) - f(x+\eta k) - f(x+\epsilon h) + f(x)}
{\epsilon\eta}.
\end{aligned}$$
In the particular case that $k = h$, we take $\eta = \epsilon$ to get
$$
D^2f_x(h,h)
= \lim_{\epsilon\to0}
\frac{f(x+2\epsilon h) - 2f(x+\epsilon h) + f(x)}
{\epsilon^2}.
$$
It is uninteresting to differentiate $y \mapsto Df_x(y)$
because the differential of any linear function is itself;
that is, if $f$ is linear then
$$
Df_x(h) = f(h),
$$
so in particular
$$
D[y \mapsto Df_x(y)]_z(k) = Df_x(k).
$$
Another name for multilinear function is tensor.
When $V = W$,
$D^2f_x$ is a type $(1,2)$-tensor (it takes in two vectors and spits out one);
when $W = \R$, it is a type $(0,2)$-tensor;
and generally when $W = V^{\otimes k}$ it is a type $(k,2)$-tensor.
Linear functions $V \to V$ are type $(1,1)$-tensors,
so typically this is how matrices are thought of as well.
However, given a $(0,2)$-tensor $T$,
we can form a matrix $\mathbf T$ whose entries are $T(e_i, e_j)$
where $\{e_i\}_{i=1}^m$ is a basis for $V$;
given vector $v_1, v_2 \in V$ expressed as column vectors in this basis,
$$
T(v_1, v_2) = v_1^{\top}\mathbf Tv_2.
$$
Now take $V = \R^m$ and $W = \R$ so that $f : \R^m \to \R$.
Then $D^2f_x : \R^m\times\R^m \to \R$;
the matrix corresponding to this $(0,2)$-tensor
expressed in the standard basis of $\R^m$ is the Hessian.
The moral of the above story is this: the differential of some $f$ gives us a function $x \mapsto Df_x$, so we can also differentiate this to get $D^2f_x$. Both applications of $D$ also need a direction to differentiate along, so $D^2f_x$ has two arguments giving us $D^2f_x(h, k)$. If $f$ is a function taking vectors to scalars, then it turns out that there is a matrix $\mathbf H_x$ (dependent on $x$) such that
$$
D^2f_x(h, k) = h^\top\mathbf H_x k,
$$
and we call $\mathbf H$ the Hessian of $f$.
Best Answer
If $A=\left[\begin{smallmatrix}a&b\\c&d\end{smallmatrix}\right]$, then\begin{multline}f(x,y)=ax^2-2axx_0+ax_0^{\,2}+bxy+cxy-bx_0y-cx_0y+\\+dy^2-bxy_0-cxy_0+bx_0y_0+cx_0y_0-2dyy_0+dy_0^{\,2}.\end{multline}and the Hessian of this expression is$$\begin{bmatrix}2a&b+c\\b+c&2d\end{bmatrix}.$$And this is equal to$$\begin{bmatrix}a&c\\b&d\end{bmatrix}+\begin{bmatrix}a&b\\c&d\end{bmatrix}.$$