Well, I think it is really difficult to present a visual explanation of Canonical correlation analysis (CCA) vis-a-vis Principal components analysis (PCA) or Linear regression. The latter two are often explained and compared by means of a 2D or 3D data scatterplots, but I doubt if that is possible with CCA. Below I've drawn pictures which might explain the essence and the differences in the three procedures, but even with these pictures - which are vector representations in the "subject space" - there are problems with capturing CCA adequately. (For algebra/algorithm of canonical correlation analysis look in here.)
Drawing individuals as points in a space where the axes are variables, a usual scatterplot, is a variable space. If you draw the opposite way - variables as points and individuals as axes - that will be a subject space. Drawing the many axes is actually needless because the space has the number of non-redundant dimensions equal to the number of non-collinear variables. Variable points are connected with the origin and form vectors, arrows, spanning the subject space; so here we are (see also). In a subject space, if variables have been centered, the cosine of the angle between their vectors is Pearson correlation between them, and the vectors' lengths squared are their variances. On the pictures below the variables displayed are centered (no need for a constant arises).
Principal Components
Variables $X_1$ and $X_2$ positively correlate: they have acute angle between them. Principal components $P_1$ and $P_2$ lie in the same space "plane X" spanned by the two variables. The components are variables too, only mutually orthogonal (uncorrelated). The direction of $P_1$ is such as to maximize the sum of the two squared loadings of this component; and $P_2$, the remaining component, goes orthogonally to $P_1$ in plane X. The squared lengths of all the four vectors are their variances (the variance of a component is the aforementioned sum of its squared loadings). Component loadings are the coordinates of variables onto the components - $a$'s shown on the left pic. Each variable is the error-free linear combination of the two components, with the corresponding loadings being the regression coefficients. And vice versa, each component is the error-free linear combination of the two variables; the regression coefficients in this combination are given by the skew coordinates of the components onto the variables - $b$'s shown on the right pic. The actual regression coefficient magnitude will be $b$ divided by the product of lengths (standard deviations) of the predicted component and the predictor variable, e.g. $b_{12}/(|P_1|*|X_2|)$. [Footnote: The components' values appearing in the mentioned above two linear combinations are standardized values, st. dev. = 1. This because the information about their variances is captured by the loadings. To speak in terms of unstandardized component values, $a$'s on the pic above should be eigenvectors' values, the rest of the reasoning being the same.]
Multiple Regression
Whereas in PCA everything lies in plane X, in multiple regression there appears a dependent variable $Y$ which usually doesn't belong to plane X, the space of the predictors $X_1$, $X_2$. But $Y$ is perpendicularly projected onto plane X, and the projection $Y'$, the $Y$'s shade, is the prediction by or linear combination of the two $X$'s. On the picture, the squared length of $e$ is the error variance. The cosine between $Y$ and $Y'$ is the multiple correlation coefficient. Like it was with PCA, the regression coefficients are given by the skew coordinates of the prediction ($Y'$) onto the variables - $b$'s. The actual regression coefficient magnitude will be $b$ divided by the length (standard deviation) of the predictor variable, e.g. $b_{2}/|X_2|$.
Canonical Correlation
In PCA, a set of variables predict themselves: they model principal components which in turn model back the variables, you don't leave the space of the predictors and (if you use all the components) the prediction is error-free. In multiple regression, a set of variables predict one extraneous variable and so there is some prediction error. In CCA, the situation is similar to that in regression, but (1) the extraneous variables are multiple, forming a set of their own; (2) the two sets predict each other simultaneously (hence correlation rather than regression); (3) what they predict in each other is rather an extract, a latent variable, than the observed predictand of a regression (see also).
Let's involve the second set of variables $Y_1$ and $Y_2$ to correlate canonically with our $X$'s set. We have spaces - here, planes - X and Y. It should be notified that in order the situation to be nontrivial - like that was above with regression where $Y$ stands out of plane X - planes X and Y must intersect only in one point, the origin. Unfortunately it is impossible to draw on paper because 4D presentation is necessary. Anyway, the grey arrow indicates that the two origins are one point and the only one shared by the two planes. If that is taken, the rest of the picture resembles what was with regression. $V_x$ and $V_y$ are the pair of canonical variates. Each canonical variate is the linear combination of the respective variables, like $Y'$ was. $Y'$ was the orthogonal projection of $Y$ onto plane X. Here $V_x$ is a projection of $V_y$ on plane X and simultaneously $V_y$ is a projection of $V_x$ on plane Y, but they are not orthogonal projections. Instead, they are found (extracted) so as to minimize the angle $\phi$ between them. Cosine of that angle is the canonical correlation. Since projections need not be orthogonal, lengths (hence variances) of the canonical variates are not automatically determined by the fitting algorithm and are subject to conventions/constraints which may differ in different implementations. The number of pairs of canonical variates (and hence the number of canonical correlations) is min(number of $X$s, number of $Y$s). And here comes the time when CCA resembles PCA. In PCA, you skim mutually orthogonal principal components (as if) recursively until all the multivariate variability is exhausted. Similarly, in CCA mutually orthogonal pairs of maximally correlated variates are extracted until all the multivariate variability that can be predicted in the lesser space (lesser set) is up. In our example with $X_1$ $X_2$ vs $Y_1$ $Y_2$ there remains the second and weaker correlated canonical pair $V_{x(2)}$ (orthogonal to $V_x$) and $V_{y(2)}$ (orthogonal to $V_y$).
For the difference between CCA and PCA+regression see also Doing CCA vs. building a dependent variable with PCA and then doing regression.
What is the benefit of canonical correlation over individual Pearson correlations of pairs of variables from the two sets? (my answer's in comments).
The first step of ICA is to use PCA and project the dataset into a low-dimensional latent space. The second step is to perform a change of coordinates within the latent space, which is chosen to optimize a measure of non-gaussianity. This tends to lead to coefficients and loadings that are, if not sparse, then at least concentrated within small numbers of observations and features, and that way it facilitates interpretation.
Likewise, in this paper on CCA+ICA (Sui et al., "A CCA+ICA based model for multi-task brain imaging data fusion and its application to schizophrenia"), the first (see footnote) step is to perform CCA, which yields a projection of each dataset into a low-dimensional space. If the input datasets are $X_1$ and $X_2$, each with $N$ rows=observations, then CCA yields $Z_1 = X_1W_1$ and $Z_2 = X_2W_2$ where the $Y$'s also have $N$ rows=observations. Note that the $Y$'s have a small number of columns, paired between $Y_1$ and $Y_2$, as opposed to the $X$'s, which may not even have the same number of columns. The authors then apply the same coordinate-changing strategy as is used in ICA, but they apply it to the concatenated matrix $[Z_1 | Z_2]$.
Footnote: the authors also use preprocessing steps involving PCA, which I ignore here. They are part of the paper's domain-specific analysis choices, rather than being essential to the CCA+ICA method.
Best Answer
This is a good question, but as it appears from it that you know PCA and CCA a deal, so you are able to answer it yourself. And you do:
Absolutely true. The correlation of the 1st Y's PC with X set will almost always be weaker than the correlation of the 1st Y's CV with it. This comes apparent from pictures comparing PCA with CCA actions.
The PCA+regression you conceive of is two-step, initially "unsupervised" ("blind", as you said) strategy, while CCA is one-step, "supervised" strategy. Both are valid - each in own investigatory settings!
1st principal component (PC1) obtained in PCA of set Y is a linear combination of Y variables. 1st canonical variate (CV1) extracted from set Y in CCA of sets Y and X is a linear combination of Y variables, too. But they are different. (Explore the linked pics, also pay attention to the phrase that CCA is closer to - actually a form of - regression than to PCA.)
PC1 represents set Y. It is the linear summary and the "deputy" from set Y, to face the outer-world relationships later (such as in a subsequent regression of PC1 by variables X).
CV1 represents set X within set Y. It is the linear image of X belonging to Y, the "insider" in Y. The Y-X relationship is already there: CCA is a multivariate regression.
Suppose I've got a children sample's results on a school anxiety questionnaire (such as Phillips test) - Y items, and their results on a social adaptation questionnaire - X items. I want to establish the relationship between the two sets. Items of both inside X and inside Y correlate, but they are quite different and I'm not pleased with the idea to bluntly sum up item scores into a single score in either set, so I'm choosing to stay multivariate.
If I do PCA of Y, extracting PC1, and then regress in on X items, what does it mean? It means that I respect the anxiety questionnaire (Y items) as the sovereign (closed) domain of phenomena, which can express oneself. Express by issuing its best weighted sum of items (accounting for the maximal variance) which represents the whole set Y - its general factor/pivot/trend, "mainstream school anxiety complex", the PC1. It is not before that representation is formed that I turn to the next question how it might be related to social adaptation, the question I'll check in the regression.
If I do CCA of Y vs X, extracting the 1st pair of canonical variates - one from each set - having maximal correlation, what does it mean? It means that I suspect the common factor between (behind) both anxiety and adaptation which makes them correlate with each other. However, I have no reason or ground to extract or model that factor by means of PCA or Factor analysis of the combined set "X variables + Y variables" (because, for example, I see anxiety and adaptation as two quite different domains conceptually, or because the two questionnaires have very different scales (units) or differently-shaped distributions that I fear to "merge", or the number of items is very different in them). I'll be content with just the canonical correlation between the sets. Or I might not be supposing any "common factor" behind the sets, and simply think "X effects Y". Since Y is multivariate the effect is multidimensional, and I'm asking for the 1st-order, strongest effect. It is given by the 1st canonical correlation and the prediction variable corresponding to it is the CV1 of set Y. CV1 is fished out of Y, Y is not selbständig producer of it.