Let $V$ and $W$ be vector spaces over a field $\mathbb{K}$. You (hopefully!) should know that a function $f\colon V\to W$ is a linear transformation if for all $u,v\in V$ and all $\lambda,\mu\in\mathbb{K}$, we have
$$f(\lambda u+\mu v)=\lambda f(u)+\mu f(v).$$
(There are more efficient equivalent definitions, but this should hopefully look familiar). For example, if $V=\mathbb{R}^2$ and $W=\mathbb{R}$, then the map $\alpha\colon\mathbb{R}^2\to\mathbb{R}$ defined by $\alpha(a,b)=a$ is a linear transformation.
Now for some of the other terms - both isomorphisms and linear functionals are specific types of linear maps. A linear functional is a linear map whose codomain (i.e. $W$, in the notation above) is equal to the field $\mathbb{K}$ (which is in particular a vector space over itself). Our example $\alpha$ from before is a functional, because $W=\mathbb{R}$.
A linear transformation is an isomorphism if it is invertible. The map $\alpha$ above is not invertible because it isn't injective. However, the map $\beta\colon\mathbb{R}^2\to\mathbb{R}^2$ defined by $\beta(a,b)=(b,a)$ is an isomorphism (it is in fact its own inverse!). However, $\beta$ is not a linear functional, because its codomain is not $\mathbb{R}$.
A dual space is entirely different, and is not a type of linear transformation. Given a vector space $V$ over $\mathbb{K}$, the dual space $V^*$ is the set of all linear functionals with domain $V$, i.e. the set of all linear maps $V\to\mathbb{K}$. In fact this is more than a set; it is a vector space over $\mathbb{K}$, under the operations $(f+g)(v)=f(v)+g(v)$ and $(\lambda\cdot f)(v)=\lambda\cdot f(v)$.
I hope this helps clarify the definitions a little.
Edit: You added a subquestion about matrices - I intentionally didn't use matrices anywhere in my answer. One advantage of this is that everything I say works even for infinite dimensional vector spaces, where matrices don't really work (it is possible to imagine matrices of infinite size, but this isn't necessarily a good idea!). The other reason to avoid them is that to "turn a linear map $V\to W$ into a matrix" requires choosing bases for $V$ and $W$; this choice is arbitrary, and different choices result in different matrices, which can very quickly get confusing.
On the other hand, it is very useful to know how to check (for example) whether a linear map between finite dimensional vector spaces is invertible by choosing some bases to get a matrix representing it, and then doing computations with the matrix.
One way to motivate dual spaces and transposes is to consider differentiation of scalar-valued functions of several variables. The basic point is that functionals are the easiest functions to deal with short of constant functions, so that differentiation is essentially approximation by a unique functional such that the error in the approximation is sufficiently well behaved. Moreover, transposes arise naturally when differentiating, say, the composition of a scalar-valued function with a change of coordinates.
Let $f : (a,b) \to \mathbb{R}$. Conventionally, one defines $f$ to be differentiable at $x \in (a,b)$ if the limit
$$
\lim_{h \to 0} \frac{f(x+h)-f(x)}{h}
$$
exists, in which case the value of that limit is defined to be the derivative $f^\prime(x)$ of $f$ at $x$. Observe, however, that this definition means that for $h$ small enough,
$$
f(x+h)-f(x) = f^\prime(x)h + R_x(h),
$$
where $h \to f^\prime(x)h$ defines a linear transformation $df_x :\mathbb{R} \to \mathbb{R}$ approximating $f$ near $x$, and where the error term $R_x(h)$ satisfies
$$
\lim_{h \to 0} \frac{R_x(h)}{h} = 0.
$$
In fact, $f$ is differentiable at $x$ if and only if there exists a linear transformation $T : \mathbb{R} \to \mathbb{R}$ such that
$$
\lim_{h \to 0} \frac{\lvert f(x+h) - f(x) - T(h) \rvert}{\lvert h \rvert} = 0,
$$
in which case $df_x := T$ is unique, and given by multiplication by the scalar $f^\prime(x) = T(1)$.
Now, let $f : U \to \mathbb{R}^m$, where $U$ is an open subset of $\mathbb{R}^n$. Then, we can still perfectly define $f$ to be differentiable at $x \in U$ if and only if there exists a linear transformation $T : \mathbb{R}^n \to \mathbb{R}^m$ such that
$$
\lim_{h \to 0} \frac{\| f(x+h) - f(x) - T(h) \|}{\|h\|} = 0,
$$
in which case $df_x := T$ is unique; in particular, for $\|h\|$ small enough,
$$
f(x+h) - f(x) = df_x(h) + R_x(h),
$$
where $df_x$ gives a linear approximation of $f$ near $x$, such that the error term $R_x(h)$ satisfies
$$
\lim_{h \to 0} \frac{R_x(h)}{\|h\|} = 0.
$$
At last, let's specialise to the case where $f : U \to \mathbb{R}$, i.e., where $m=1$. If $f$ is differentiable at $x$, then $df_x : \mathbb{R}^n \to \mathbb{R}$ is linear, and hence $df_x \in (\mathbb{R}^n)^\ast$ by definition. In particular, for any $v \in \mathbb{R}^n$, the directional derivative
$$
\nabla_v f(x) := \lim_{\epsilon \to 0} \frac{f(x+\epsilon v) - f(x)}{\epsilon}
$$
exists and is given by
$$
\nabla_v f(x) = (d_x f)(v).
$$
Moreover, the gradient of $f$ at $x$ is exactly the unique vector $\nabla f(x) \in \mathbb{R}^n$ such that
$$
\forall v \in \mathbb{R}^n, \quad (d_x f)(v) = \langle \nabla f(x), v \rangle.
$$
In any event, the derivative of a scalar-valued function of $n$ variables at a point is most naturally understood as a functional on $\mathbb{R}^n$, i.e., as an element of $(\mathbb{R}^n)^\ast$.
Now, suppose, for simplicity, that $f : \mathbb{R}^n \to \mathbb{R}$ is everywhere-differentiable, and let $S : \mathbb{R}^p \to \mathbb{R}^n$ be a linear transformation, e.g., a coordinate change $\mathbb{R}^n \to \mathbb{R}^n$. Then $f \circ S$ is indeed everywhere differentiable with derivative $$d_y(f \circ S) = (d_{Sy} f) \circ S = S^t d_{Sy} f,$$ at $y \in \mathbb{R}^p$. On the one hand, if $S = 0$, then $f \circ S = f(0)$ is constant, so that $d_y(f \circ S) = 0 = S^t d_{Sy} f$, as required. On the other hand, if $S \neq 0$, so that
$$
\|S\| := \sup_{k \neq 0} \frac{\|Sk\|}{\|k\|} > 0,
$$
it follows that
$$
\frac{\|(f \circ S)(y+k)-(d_{Sy} f \circ S)(k)\|}{\|k\|} = \frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|k\|} \leq \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|} \to 0, \quad k \to 0
$$
by differentiability of $f$ at $Sy$ and continuity of the map
$$
k \mapsto \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|}.
$$
More concretely, once you know that $f \circ S$ is differentiable everywhere, then for each $v \in \mathbb{R}^n$, by linearity of $S$,
$$
(f \circ S)(y + \epsilon v) = f(Sy + \epsilon Sk),
$$
so that, indeed
$$
\left(d_y(f \circ S)\right)(k) = \nabla_k(f \circ S)(y) = \nabla_{Sk}f(Sy) = (d_{Sy}f)(Sk) = (S^t d_{Sy}f)(k).
$$
In general, if $S : \mathbb{R}^p \to \mathbb{R}^n$ is everywhere differentiable (again, for simplicity), then
$$
d_y (f \circ S) = (d_{Sy}f) \circ d_y S = (d_y S)^t d_{Sy}f,
$$
which is none other than the relevant case of the chain rule.
Best Answer
In the special case of matrix algebra, this turns out to be fairly obvious.
In this setting, one usually writes vectors as $n \times 1$ matrices ("column vectors"), and linear functionals as $1 \times n$ matrices ("row vectors").
If we have a matrix $A$ with suitable dimensions, then "multiplication on the left" results in a linear transformation (call it $T$): i.e. $T(v) = Av$. The dual transformation is "multiplication on the right". That is, $T^*(u) = uA$. So your identity is merely
$$ (uA)v = u(Av) $$