Let $V$ be a vector space (over any field, but we can take it to be $\mathbb R$ if you like,
and for concreteness I will take the field to be $\mathbb R$ from now on;
everything is just as interesting in that case). Certainly one of the interesting concepts
in linear algebra is that of a hyperplane in $V$.
For example, if $V = \mathbb R^n$, then a hyperplane is just the solution set to an equation
of the form
$$a_1 x_1 + \cdots + a_n x_n = b,$$
for some $a_i$ not all zero and some $b$.
Recall that solving such equations (or simultaneous sets of such equations) is one
of the basic motivations for developing linear algebra.
Now remember that when a vector space is not given to you as $\mathbb R^n$,
it doesn't normally have a canonical basis, so we don't have a canonical way
to write its elements down via coordinates, and so we can't describe hyperplanes
by explicit equations like above. (Or better, we can, but only after choosing
coordinates, and this is not canonical.)
How can we canonically describe hyperplanes in $V$?
For this we need a conceptual interpretation of the above equation. And here linear
functionals come to the rescue. More precisely, the map
$$\begin{align*}
\ell: \mathbb{R}^n &\rightarrow \mathbb{R} \\
(x_1,\ldots,x_n) &\mapsto a_1 x_1 + \cdots + a_n x_n
\end{align*}$$
is a linear functional on $\mathbb R^n$, and so the above equation for the
hyperplane can be written as
$$\ell(v) = b,$$
where $v = (x_1,\ldots,x_n).$
More generally, if $V$ is any vector space, and $\ell: V \to \mathbb R$ is any
non-zero linear functional (i.e. non-zero element of the dual space), then
for any $b \in \mathbb R,$ the set
$$\{v \, | \, \ell(v) = b\}$$
is a hyperplane in $V$, and all hyperplanes in $V$ arise this way.
So this gives a reasonable justification for introducing the elements of the dual
space to $V$; they generalize the notion of linear equation in several variables
from the case of $\mathbb R^n$ to the case of an arbitrary vector space.
Now you might ask: why do we make them a vector space themselves? Why do we want
to add them to one another, or multiply them by scalars?
There are lots of reasons for this; here is one: Remember how important it is,
when you solve systems of linear equations, to add equations together, or
to multiply them by scalars (here I am referring to all the steps you typically
make when performing Gaussian elimination on a collection of simultaneous linear
equations)? Well, under the dictionary above between linear equations
and linear functionals, these processes correspond precisely to adding together
linear functionals, or multiplying them by scalars. If you ponder this for a bit,
you can hopefully convince yourself that making the set of linear
functionals a vector space is a pretty natural thing to do.
Summary: just as concrete vectors $(x_1,\ldots,x_n) \in \mathbb R^n$ are naturally
generalized to elements of vector spaces, concrete linear expressions
$a_1 x_1 + \ldots + a_n x_n$ in $x_1,\ldots, x_n$ are naturally generalized to linear functionals.
The exactness of the first sequence means that $S$ is injective, $T$ surjective, and the range of $S$ meets the kernel of $T$ just the right way in $V$.
Okay, so to show that the second sequence is exact, we'll start by showing $\circ T$ is injective. Let $g,g'$ be elements of $W^{*}$. Suppose that $g(T) = g'(T)$. Since $T$ is surjective, for any $w \in W$ there is some $v \in V$ so that $T(v) = w$. Then $g(T(v)) = g'(T(v))$ so that $g(w) = g'(w)$, so that $g$ and $g'$ are the same on all elements of $W$, and hence are the same element of $W^*$.
Next, we'll show that $\circ S$ is surjective. Let $h$ be an arbitrary element of $U^*$. We want to produce an element $f \in V^*$ such that $f(S) = h$. We can define $f$ on the range of $S$, knowing that it can be extended to a linear functional on all of $V$. On the range of $S$, define $f$ to be $h(S^{-1})$. This makes sense, since $S$ is injective. Then $f(S) = h(S^{-1}(S)) = h$, proving surjectivity of $\circ S$.
I'll leave it to you to verify that $V^*$ splits as the range of $\circ T$ and the kernel of $\circ S$, using the techniques outlined in the prior steps.
Best Answer
One way to motivate dual spaces and transposes is to consider differentiation of scalar-valued functions of several variables. The basic point is that functionals are the easiest functions to deal with short of constant functions, so that differentiation is essentially approximation by a unique functional such that the error in the approximation is sufficiently well behaved. Moreover, transposes arise naturally when differentiating, say, the composition of a scalar-valued function with a change of coordinates.
Let $f : (a,b) \to \mathbb{R}$. Conventionally, one defines $f$ to be differentiable at $x \in (a,b)$ if the limit $$ \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} $$ exists, in which case the value of that limit is defined to be the derivative $f^\prime(x)$ of $f$ at $x$. Observe, however, that this definition means that for $h$ small enough, $$ f(x+h)-f(x) = f^\prime(x)h + R_x(h), $$ where $h \to f^\prime(x)h$ defines a linear transformation $df_x :\mathbb{R} \to \mathbb{R}$ approximating $f$ near $x$, and where the error term $R_x(h)$ satisfies $$ \lim_{h \to 0} \frac{R_x(h)}{h} = 0. $$ In fact, $f$ is differentiable at $x$ if and only if there exists a linear transformation $T : \mathbb{R} \to \mathbb{R}$ such that $$ \lim_{h \to 0} \frac{\lvert f(x+h) - f(x) - T(h) \rvert}{\lvert h \rvert} = 0, $$ in which case $df_x := T$ is unique, and given by multiplication by the scalar $f^\prime(x) = T(1)$.
Now, let $f : U \to \mathbb{R}^m$, where $U$ is an open subset of $\mathbb{R}^n$. Then, we can still perfectly define $f$ to be differentiable at $x \in U$ if and only if there exists a linear transformation $T : \mathbb{R}^n \to \mathbb{R}^m$ such that $$ \lim_{h \to 0} \frac{\| f(x+h) - f(x) - T(h) \|}{\|h\|} = 0, $$ in which case $df_x := T$ is unique; in particular, for $\|h\|$ small enough, $$ f(x+h) - f(x) = df_x(h) + R_x(h), $$ where $df_x$ gives a linear approximation of $f$ near $x$, such that the error term $R_x(h)$ satisfies $$ \lim_{h \to 0} \frac{R_x(h)}{\|h\|} = 0. $$
At last, let's specialise to the case where $f : U \to \mathbb{R}$, i.e., where $m=1$. If $f$ is differentiable at $x$, then $df_x : \mathbb{R}^n \to \mathbb{R}$ is linear, and hence $df_x \in (\mathbb{R}^n)^\ast$ by definition. In particular, for any $v \in \mathbb{R}^n$, the directional derivative $$ \nabla_v f(x) := \lim_{\epsilon \to 0} \frac{f(x+\epsilon v) - f(x)}{\epsilon} $$ exists and is given by $$ \nabla_v f(x) = (d_x f)(v). $$ Moreover, the gradient of $f$ at $x$ is exactly the unique vector $\nabla f(x) \in \mathbb{R}^n$ such that $$ \forall v \in \mathbb{R}^n, \quad (d_x f)(v) = \langle \nabla f(x), v \rangle. $$ In any event, the derivative of a scalar-valued function of $n$ variables at a point is most naturally understood as a functional on $\mathbb{R}^n$, i.e., as an element of $(\mathbb{R}^n)^\ast$.
Now, suppose, for simplicity, that $f : \mathbb{R}^n \to \mathbb{R}$ is everywhere-differentiable, and let $S : \mathbb{R}^p \to \mathbb{R}^n$ be a linear transformation, e.g., a coordinate change $\mathbb{R}^n \to \mathbb{R}^n$. Then $f \circ S$ is indeed everywhere differentiable with derivative $$d_y(f \circ S) = (d_{Sy} f) \circ S = S^t d_{Sy} f,$$ at $y \in \mathbb{R}^p$. On the one hand, if $S = 0$, then $f \circ S = f(0)$ is constant, so that $d_y(f \circ S) = 0 = S^t d_{Sy} f$, as required. On the other hand, if $S \neq 0$, so that $$ \|S\| := \sup_{k \neq 0} \frac{\|Sk\|}{\|k\|} > 0, $$ it follows that $$ \frac{\|(f \circ S)(y+k)-(d_{Sy} f \circ S)(k)\|}{\|k\|} = \frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|k\|} \leq \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|} \to 0, \quad k \to 0 $$ by differentiability of $f$ at $Sy$ and continuity of the map $$ k \mapsto \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|}. $$ More concretely, once you know that $f \circ S$ is differentiable everywhere, then for each $v \in \mathbb{R}^n$, by linearity of $S$, $$ (f \circ S)(y + \epsilon v) = f(Sy + \epsilon Sk), $$ so that, indeed $$ \left(d_y(f \circ S)\right)(k) = \nabla_k(f \circ S)(y) = \nabla_{Sk}f(Sy) = (d_{Sy}f)(Sk) = (S^t d_{Sy}f)(k). $$ In general, if $S : \mathbb{R}^p \to \mathbb{R}^n$ is everywhere differentiable (again, for simplicity), then $$ d_y (f \circ S) = (d_{Sy}f) \circ d_y S = (d_y S)^t d_{Sy}f, $$ which is none other than the relevant case of the chain rule.