Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.
Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.
This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:
And the same topic, but in the context of classification:
However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.
In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.
In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.
See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).
Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:
[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance
of gradients estimated in the short directions. The implicit assumption is
that the response will tend to vary most in the directions of high variance
of the inputs. This is often a reasonable assumption, since predictors are
often chosen for study because they vary with the response variable, but
need not hold in general.
See my answers in the following threads for details:
Bottom line
For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.
I will try to explain how orthogonality of $a_1$ and $a_2$ ensures that $y_1$ and $y_2$ be uncorrelated. We want $a_1$ to maximize $Var(y_1)=a_1^T \Sigma a_1$. This will not be achieved unless we constrain $a_1$, in this case by $a_1^T a_1=1$. This optimization calls for the use of a Lagrange Multiplier (it's not too complicated, read about it on Wikipedia). We thus try to maximize
\begin{equation}
a_1^T \Sigma a_1 - \lambda(a_1^T a_1-1)
\end{equation}
with respect to both $a_1$ and $\lambda$. Notice that differentiation with respect to $\lambda$ and then equating to $0$ gives our constraint $a_1^T a_1=1$. Differentiation with respect to $a_1$ gives
\begin{equation}
\Sigma a_1 -\lambda a_1 =0
\end{equation}
or
\begin{equation}
(\Sigma -\lambda I_p)a_1=0
\end{equation}
variance of $y_1$ will be maximized by the greatest eigenvalue $\lambda_1$. Thus $\lambda_1 a_1=\Sigma a_1$. Here comes the part that will answer your question. Some elementary calculations using the definition of covariance will show that
\begin{equation}
Cov(y_1,y_2)=Cov(a^T_1 x,a^T_2 x)=a^T_1\Sigma a_2=a^T_2\Sigma a_1=a^T_2\lambda_1 a_1=\lambda_1 a^T_2 a_1
\end{equation}
which will equal $0$ if and only if $a^T_2 a_1=0$.
Best Answer
Consider what PCA does. Put simply, PCA (as most typically run) creates a new coordinate system by:
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, it doesn't just rotate your axes any old way. Your new $X_1$ (the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine @amoeba's example. Here is a data matrix with two points in a three dimensional space:
$$ X = \bigg[ \begin{array}{ccc} 1 &1 &1 \\ 2 &2 &2 \end{array} \bigg] $$ Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at $(1.5, 1.5, 1.5)$. (2) The axes are already equal. (3) The first principal component will go diagonally from $(0,0,0)$ to $(3,3,3)$, which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from $(0,0,3)$ to $(3,3,0)$, or from $(0,3,0)$ to $(3,0,3)$, or something else? There is no remaining variation, so there cannot be any more principal components.
With $N=2$ data, we can fit (at most) $N-1 = 1$ principal components.