We will start from
$$\begin{align}
D^* &= \underset{D}{arg\max}\;Tr\ (D^TX^TXD)\\
&= \underset{D}{arg\max}\left[Tr\ (D_{l-1}^TX^TXD_{l-1}) + d^{(l)T}X^TXd^{(l)}\right]
\end{align}
$$
Where we used the notation $D_{k}$ to denote the matrix with first $l-1$ columns of $D$.
The 2 summands in the expression share no common terms of $D$ and hence can be maximized independently.
Using the induction hypothesis, we conclude that $Tr\ (D_{l-1}^TX^TXD_{l-1})$ (with the constraint that the columns of $D_{l-1}$ are orthonormal) is maximized when $D_{l-1}$ comprises of the orthonormal eigenvectors corresponding the $l-1$ largest eigenvalues.
Notation:
Suppose $\lambda_1 \geqslant ... \geqslant\lambda_n$ are the eigenvalues and $v_1, ..., v_n$ are the corresponding orthonormal eigenvectors.
Denote $H_{l-1} = span\{v_1, ...,v_{l-1}\}$ and $H_{l-1}^{\bot}$ the orthogonal subspace of $H_{l-1}$ i.e. $H_{l-1}^{\bot} = span\{v_l,...,v_n\}$
Lemma:
$$\begin{align}\lambda_l &= \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} \quad s.t. \Vert d^{(l)}\Vert = 1, d^{(l)} \in H_{l-1}^\bot \\
&=v_l^TX^TXv_l \end{align}$$
Proof:
Let $\Sigma = X^TX$. Because it's a symmetric positive semidefinite matrix, eigendecomposition exists and let it be $\Sigma = V\Lambda V^T$ where columns of $V$ are $v_1,...,v_n$ in that order and hence $\Lambda=diag(\lambda_1,...,\lambda_n)$.
$$
\begin{align}
d^{(l)T}\Sigma d^{(l)} &= d^{(l)T} V\Lambda V^T d^{(l)}\\
&= q^T \Lambda\ q \qquad [where\ q = V^Td^{(l)}]\\
&= \sum_{i=1}^n q_i^2 \lambda_i \qquad [where\ q_i = (V^Td^{(l)})_i = v_i^T d^{(l)}]\\
&= \sum_{i=l}^n q_i^2 \lambda_i \qquad [\because d^{(l)} \in H_{l-1}^\bot \implies q_i = v_i^T d^{(l)} = 0\ \forall i < l]\\
\end{align}
$$
Reminder:
$d^{(l)} \in H_{l-1}^\bot$ s.t.
$\sum_{k=l}^n \alpha_k V_K; \sum_{k=l} \alpha_k^2 = 1$
Now
$$\begin{align}
\sum_{i=l}^n q_i^2
&= \sum_{i=1}^n (V_i^T \sum_{k=l}^n \alpha_k V_k)^2 \\
&= \sum_{i=l}^n (\alpha_i V_i^T V_i)^2 \qquad [\because V\ is\ orthogonal] \\
&= \sum_{i=l}^n \alpha^2 = 1
\end{align}
$$
Therefore $d^{(l)T} \Sigma d^{(l)}$ is a convex combination of $\lambda_l,...,\lambda_n$ and $$\underset{d^{(l)}}{max}\ d^{(l)T}\Sigma d^{(l)} = \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} = v_l^TX^TXv_l = \lambda_l \ (qed)$$
We conclude that $D^*$ is obtained by augmenting $D_{l-1}$ with the column $v_l$ which completes the original proof.
Since $X_i$'s are independent, $\operatorname{Cov}(X_i,X_j)=0$ if $i\neq j$. It follows that $$\operatorname{Var}(S)=\operatorname{Var}\left(\sum_{i=1}^na_iX_i\right)=\sigma^2\sum_{i=1}^na_i^2.$$ We can bypass the lengthy Lagrangian method by using Cauchy-Schwarz inequality: $$1=\left(\sum_{i=1}^na_i\right)^2=\left(\sum_{i=1}^n\frac{a_i\sigma}{\sigma}\right)^2\leq\left(\sum_{i=1}^na_i^2\sigma^2\right)\left(\sum_{i=1}^n\frac{1}{\sigma^2}\right)=\frac{n\operatorname{Var}(S)}{\sigma^2}.$$ Now if $a_i=1/n$, then $$\operatorname{Var}(S)=\sigma^2\sum_{i=1}^n\frac{1}{n^2}=\frac{\sigma^2}{n},$$ which is exactly the lower bound given by Cauchy-Schwarz.
Best Answer
There is a good, long, explanation of PCA at Making sense of principal component analysis, eigenvectors & eigenvalues so here I will only show why the eigenvector with largest eigenvalue maximizes the variance of linear combinations.
Let $\Sigma$ be the variance-covariance matrix of the random variables $X_1, X_2, \dotsc, X_n$, which we will let be the components of random vector $X$. That is, $$ \DeclareMathOperator{\V}{\mathbb{V}} \DeclareMathOperator{\C}{\mathbb{C}} \C(X_i,X_j)=\Sigma_{ij} $$ Since $`sigma$ is a symmetric matrix, we can use the spectral decomposition, that is $$ \Sigma = U \Lambda U^T = \sum_{i=1}^n \lambda_i^2 u_i u_i^T $$ where $U$ is orthogonal and $\lambda$ is diagonal. Now let $a$ be an $n$vector defining the linear combination $a^T X$. We want to choose $a$ so at to maximize the variance of the linear combination: $$ \V\left( a^T X\right) = a^T \V\left( X \right) = a^T \Sigma a \\ = a^T \sum_{i=1}^n \lambda_i^2 u_i u_i^T a \\ = \sum_{i=1}^n \lambda_i^2 (a^T u_i) (u_i^T a) $$ The vectors $u_i$ are mutually ortonormal, the sum above can be seen as an average of the $n$ values $\lambda_i^2$, and it is clear that it will be maximized if we choose $a$ so that the highest of the values $\lambda_1^2$ get all the weight. That will happen if we choose $a$ to be the $u_i$-eigenvector corresponding to the largest eigenvalue $\lambda_i$. But then the weight of the rest will be zero, by orthonormality. That finishes the proof.