$\def\vect{\mathbf}
\def\diag{{\rm{diag}}}
\def\R{\mathbb R}
\def\vol{{\rm vol}}
\def\sign{{\rm sign}}
$
There are two ideas involved: One is that uniqueness of Lebesgue measure as a translation invariant Borel measure implies scaling invariance under linear transformations, and in particular absolute invariance under orthogonal transformations. The other is one or another multiplicative decomposition of a linear transformation. I will use the singular value decomposition.
Theorem. Lebesgue measure $\lambda$ on $\mathbb R^n$ is the unique Borel measure that is translation invariant and locally finite (i.e. measure of compacts sets is finite), up to scaling by a positive constant.
This is contained in Rudin, Real and Complex Analysis, 3rd edition, Theorem 2.20.
Corollary
If $T$ is any invertible linear transformation of $\mathbb R^n$, then there exists a positive constant $c_T$ such that for all Borel sets $E$, $\lambda(T(E)) = c_T\ \lambda(E)$.
IF $S$, $T$ are invertible linear transformations $c_{ST} = c_S c_T$
If $U$ is an orthogonal linear transformation, then $c_U = 1$.
Proof. For (1), note that $E \mapsto \lambda(T(E))$ is a translation invariant locally finite Borel measure. Part (2) is obvious. For part (3), it suffices to find a Borel set $S$ such that $0 < \lambda(S) < \infty$ and $U(S) = S$. But $S = \{x : ||x|| \le 1\}$ will do.
Thus Lebesgue measure is invariant under orthogonal transformations as well as under translations.
Lemma (Singular value decomposition) For any invertible matrix $A$, there exists two orthogonal matrices $W$, $V$ and a diagonal matrix $D = \diag(a_1, \dots, a_n)$, with $a_i >0$, such that $A = W D V$.
Remark: One can easily derive this from the polar decomposition and vice versa.
Corollary For any invertible $A$, $c_A = |\det(A)|$.
Proof. Write $A = W D V$, as in the Lemma, with $D = \diag(a_1, \dots, a_n)$.
Then $|\det(A)| = \prod_i a_i$. On the other hand,
$c_A = c_W c_D c_V = c_D$. Since $D$ applied the unit hypercube is a rectangular solid with edge lengths $a_1, \dots, a_n$, it follows that
$c_A = c_D = \prod a_i = |\det(A)|$.
Lemma. The Lebesgue measure of an affine hyperplane is zero.
Proof. It suffices to consider the coordinate hyperplane perpendicular to $\vect e_n$, using translation and orthogonal invariance. Moreover, it suffices to show that the measure of any bounded subset $K$ of this coordinate hyperplane is zero. But $K$ is contained in a rectangular solid of arbitrarily small measure.
Corollary. Let $v_1, \dots, v_n$ be given and let $P$ be the parallelepiped spanned by $v_1, \dots, v_n$ . Then $\lambda(P) =
|\det(v_1, \dots, v_n)|$. Moreover, the signed volume of $P$ is $\det(v_1, \dots, v_n)$
Proof. If the $v_i$ are linearly dependent then then $P$ has measure zero since $P$ is contained in a proper hyperplane, and the determinant is zero as well. Otherwise, let $A$ be the matrix $(v_1, \dots, v_n)$. Then $P$ is the the image of the unit hypercube under $A$, so
$\lambda(P) = c_A = |\det(A)|$. The last statement follows from the definition of signed volume, namely
$$
\vol(v_1, \dots, v_n) = \sign(\det(v_1, \dots, v_n)) \lambda(P) = \det(v_1, \dots, v_n).
$$
Remark: Occasionally one sees an explanation for the scaling of Lebesgue measure or for the formula for the Lebesgue measure of a parallelepiped which invokes the change of variable formula for integration. But these explanations are circular, as the conceptual basis for the change of variable formula is the local scaling of Lebesgue measure, which depends on the global scaling of Lebesgue measure under a linear transformation.
We can actually make further reductions. Suppose $T:\Bbb{R}^n\to\Bbb{R}^n$ is the linear transformation
\begin{align}
T(x_1,\dots, x_n)&=(x_1+x_2,x_2,\cdots, x_n)
\end{align}
In terms of matrices, we're taking the second row of the identity matrix and adding it to the first row. We only need to restrict attention to this particular one, because we can perform row swaps to ensure we're only looking at rows 1 and 2, and then perform a scalar multiplication to ensure we only deal with $c=1$. Now,
\begin{align}
T(Q)&=\{\xi\in\Bbb{R}^n\,:\, \xi_2\leq \xi_1\leq \xi_2+1\,\quad\text{and}\quad \xi_2,\dots, \xi_n\in [0,1]\}
\end{align}
(i.e just put $\xi_1=x_1+x_2$, and $\xi_j=x_j$ for $j\geq 2$, and manipulate the inequalities $x_i\in [0,1]$ for all $i$ in terms of $\xi$). Now, we have
\begin{align}
\text{vol}(T(Q))&=\int_{T(Q)}1\,dV\\
&=\int_{[0,1]^{n-2}}\int_0^1\int_{\xi_2}^{\xi_2+1}1\,d\xi_1\,d\xi_2\, d(\xi_3,\dots, \xi_n)\tag{by Fubini}\\
&=\int_0^1(\xi_2+1-\xi_2)\,d\xi_2\\
&=1.
\end{align}
Here, it's clear that the integral over the last $n-2$ coordinates is trivially $1$ (this is just the $(n-2)$-dimensional volume of the cube $[0,1]^{n-2}$).
Best Answer
You need three vectors to determine a parallelepiped. This is because you know that one of the vertices is at $0$, so if you know the positions of the remaining $3$ "nearest" vertices, the rest is determined by the parallel-ness of a parallelepiped. I will refer to these as the "defining vectors".
So let's say we have the three vectors $\textbf{a}, \textbf{b}, \textbf{c}$. Then is the parallelepiped precisely the set $\{\textbf{a}, \textbf{b}, \textbf{c}\}$? Clearly this isn't true, because $\{\textbf{a}, \textbf{b}, \textbf{c}\}$ is just a collection of three vectors, it's not a parallelepiped. What you need is the entire volume - every point it contains. For example, you need the point at the origin: $0 = 0\textbf{a}+ 0\textbf{b}+ 0\textbf{c}$. How about the vertex on the "base" furthest to the right? You can see that that point is given by $1\textbf{b}+ 1\textbf{c}$. What about the centre of the parallelepiped? That point is $0.5\textbf{a}+0.5 \textbf{b}+0.5 \textbf{c}$. In this way, we see that the $\textit{collection of all points in the parallelepiped}$ is given by $\{t_1\textbf{a}+t_2 \textbf{b}+t_3 \textbf{c}: 0 ≤ t_i ≤ 1\}$.
Now for your second question: Paragraph 2.
The volume is given by $base\times height$, in other words, the area of the shaded region multiplied by $h$. We want to figure out what $h$ is.
First we "pick vector $\textbf{a}$". Then you have two defining vectors, $\textbf{b}$ and $\textbf{c}$, left over. The subspace spanned by $\textbf{b}$ and $\textbf{c}$ is $\{t_2\textbf{b}+t_3\textbf{c}:t_i\in \mathbb{R}\}$. But from the above definition of a parallelepiped, we see that the "base" of the parallelepiped is in fact $\{0\textbf{a} + t_2\textbf{b}+t_3\textbf{c}:0≤t_i≤1\} = \{t_2\textbf{b}+t_3\textbf{c}:0≤t_i≤1\} \subset \{t_2\textbf{b}+t_3\textbf{c}:t_i\in \mathbb{R}\}$. This means the height of the parallelepiped, i.e. the "distance from the top to the base" is equal to the "distance from the top to the subspace spanned by the other two defining vectors".
As for the "top", we see that the height of the parallelepiped is determined by the position of $\textbf{a}$. Hence the "distance from the top to the subspace spanned by the other two defining vectors" is in fact the "distance from $\textbf{a}$ to the subspace spanned by $\textbf{b}$ and $\textbf{c}$".
Note: The easiest way to think about "distance from a vector to a subspace" is to think of it as the length of the line perpendicular to subspace which passes through the endpoint of the vector. Here's an example. The distance from $\textbf{b}$ to the yellow subspace is the length of $\textbf{b}-\textbf{p}$, becuase $\textbf{b}-\textbf{p}$ is perpendicular to the yellow subspace, and it passes through the end point of $\textbf{b}$.
Citations: First image is from wikipedia, second image is from ms.uky.edu