Here is an algebraic approach to adjoint operators. Let us strip away the existence of an inner product and instead take two vector spaces $V$ and $W$. Furthermore, let $V^*$ and $W^*$ be the linear duals of $V$ and $W$, that is, the collection of linear maps $V\to k$ and $W\to k$, where $k$ is the base field. If you're working over $\mathbb R$ or $\mathbb C$, or some other topological field, you might want to work with continuous linear maps between topological vector spaces.
Given a linear operator $A: V\to W$, we can define a dual map $A^*: W^* \to V^*$ by $(A^*(\phi))(v)=\phi(A(v))$. It is straight forward to verify that this gives a well defined linear map between the vector spaces. This dual map is the adjoint of $A$. For most sensible choices of dual topologies, this map should also be continuous.
The question is, how does this relate to what you are doing with inner products? Giving an inner product on $V$ is the same as giving an isomorphism between $V$ and $V^*$ as follows:
Given an inner product, $\langle x, y \rangle$, we can define an isomorphism $V\to V^*$ via $x\mapsto \langle x, - \rangle$. This will be an isomorphism by nondegeneracy. Similarly, given an isomorphism $\phi:V\to V^*$, we can define an inner product by $\langle x,y\rangle =\phi(x)(y)$. The "inner products" coming from isomorphisms will not in general be symmetric, and so they are better called bilinear forms, but we don't need to concern ourselves with this difference.
So let $\langle x,y \rangle$ be an inner product on $V$, and let $\varphi$ be the corresponding isomorphism $\varphi:V\to V^*$ defined above. Then given $A:V\to V$, we have a dual map $A^*:V^* \to V^*$. However, we can use our isomorphism to define a different dual map (also denoted $A^*$, but which we will denote by $A^{\dagger}$ to prevent confusion) by $A^{\dagger}(v)=\varphi^{-1}(A^*\phi(v))$. This is the adjoint that you are using.
Let us see why. In what follows, $x\in V, f\in V^*$. Note that $\langle x, \varphi^{-1} f \rangle = f(x)$ and so we have
$$ \langle Ax, \varphi^{-1}f \rangle = f(Ax)=(A^*f)(x)=\langle x, \varphi^{-1}(A^* f) \rangle $$
Now, let $y=\varphi^{-1}f$ so that $\varphi(y)=f$ Then we can rewrite the first and last terms of the above equality as
$$\langle Ax, y \rangle = \langle x, \varphi^{-1}(A^* \phi(y)) \rangle = \langle x, A^{\dagger}y \rangle $$
Best Answer
Note that for two diagonal matrices $A=diag(\lambda_1,\ldots,\lambda_n)$ and $B=diag(\mu_1,\ldots,\mu_n)$, you get that $tr A^T B = \sum\lambda_i\mu_i = \left<v,w\right>$ where $v=(v_1,\ldots,v_n)$ and $w=(w_1,\ldots,w_n)$.
Choose a basis $B$ and consider the set $V_B$ of all matrices which are diagonal with respect to this basis. It is easy to see that $V_B$ is a subspace (and even a subalgebra) which is isometric to $\mathbb{R}^n$ by identifying the vector $v$ with the matrix $diag(v)$ which has the elements of $v$ on its diagonal. The inner product you described (which is called the Hilbert-Schmidt inner product, by the way) is then identified with the usual inner product on $\mathbb{R}^n$.
This works for any choice of basis. Recalling that two matrix commute if and only if there is a base in which both are diagonal simultaneously, we can say that this inner product is a generalization to the non-commuting case. Or rather, that the inner product on $\mathbb{R}^n$ is the Hilbert-Schmidt product restricted to the commuting case.
This still leaves open the question, why is one of the matrices transposed? We could just define the norm to be $tr AB$. Off the top of my head I can come up with three justifications for this choice:
The norm should remain the same if we apply the same (orthogonal) basis transformation to both matrices. If $O$ is an orthogonal matrix from the basis $B$ to the basis $B'$, then we already know that in $\mathbb{R}^n$ it holds that $\left<Ov,Ow\right>=\left<v,w\right>$ for any two vectors $v,w$. In the matrix space this manifests in the fact that the map $M\mapsto OMO^t$ is an isometry from $V_B$ to $V_{B'}$. In general, we still want the property $\left<OMO^t,ONO^t\right>=\left<M,N\right>$ for any two matrices $M,N$. The Hilbert-Schmidt norm achieves that.
In the complex case, the Hilbert-Schmidt norm becomes $\left<A,B\right> = TrB^* A$ where $*$ means the conjugate transpose (the reason that I put it on $A$ rather of $B$ is a matter of convention, the real case is also usually defined as $\left<A,B\right> = Tr B^t A$). Note that this induces the usual notion of inner product $\left<u,v\right> = \sum u_i\bar{v_i}$ to diagonal matrices. Still, this doesn't explain a lot because we could have just defined the inner product to be $Tr \bar{A}B$, i.e. to just conjugate without transposing. The reason I went to the complex case is as follows: any inner product on $\mathbb{C^n}$ is of the form $\left<v,w\right> = w^* P v$ for some positive definite matrix (depending on the base). While this is also true over $\mathbb{R}$ (with transposition rather than conjugation), in $\mathbb{C}$ $P$ necessarily has a square root $Q$ which satisfies that $Q^2=P$. This allows us to define the product $\left<A,B\right>_P = \left<Q^*AQ,Q^*BQ\right>$, and it is not hard to prove that this product coincides the product $\left<v,w\right> = w^* P v$ in the commuting case. This implies that $M\mapsto Q^*MQ$ allows us to extend the operation of "twisting" the inner product by $P$ to the space of matrices (that is to say, to nicely define an isometry between the inner product structure induced on the matrix space by $\left<v,w\right>=v^*w$ to that induced by $\left(v,w\right)=v^*Pw$). But this only works if the conjugation is there.
This leaves open the small detail of the differences between defining the product as $tr B^t A$ and as $tr AB^t$, I'll leave it to you to contemplate what difference this makes if any (I suggest you consider the non square case).