Q1. Principal components are mutually orthogonal (uncorrelated) variables. Orthogonality and statistical independence are not synonyms. There is nothing special about principal components; the same is true of any variables in multivariate data analysis. If the data are multivariate normal (which is not the same as to state that each of the variables is univariately normal) and the variables are uncorrelated, then yes, they are independent. Whether independence of principal components matters or not - depends on how you are going to use them. Quite often, their orthogonality will suffice.
Q2. Yes, scaling means shrinking or stretching variance of individual variables. The variables are the dimensions of the space the data lie in. PCA results - the components - are sensitive to the shape of the data cloud, the shape of that "ellipsoid". If you only center the variables, leave the variances as they are, this is often called "PCA based on covariances". If you also standardize the variables to variances = 1, this is often called "PCA based on correlations", and it can be very different from the former (see a thread). Also, relatively seldom people do PCA on non-centered data: raw data or just scaled to unit magnitude; results of such PCA are further different from where you center the data (see a picture).
Q3. The "constraint" is how PCA works (see a huge thread). Imagine your data is 3-dimensional cloud (3 variables, $n$ points); the origin is set at the centroid (the mean) of it. PCA draws component1 as such an axis through the origin, the sum of the squared projections (coordinates) on which is maximized; that is, the variance along component1 is maximized. After component1 is defined, it can be removed as a dimension, which means that the data points are projected onto the plane orthogonal to that component. You are left with a 2-dimensional cloud. Then again, you apply the above procedure of finding the axis of maximal variance - now in this remnant, 2D cloud. And that will be component2. You remove the drawn component2 from the plane by projecting data points onto the line orthogonal to it. That line, representing the remnant 1D cloud, is defined as the last component, component 3. You can see that on each of these 3 "steps", the analysis a) found the dimension of the greatest variance in the current $p$-dimensional space, b) reduced the data to the dimensions without that dimension, that is, to the $p-1$-dimensional space orthogonal to the mentioned dimension. That is how it turns out that each principal component is a "maximal variance" and all the components are mutually orthogonal (see also).
[P.S. Please note that "orthogonal" means two things: (1) variable axes as physically perpendicular axes; (2) variables as uncorrelated by their data. With PCA and some other multivariate methods, these two things are the same thing. But with some other analyses (e.g. Discriminant analysis), uncorrelated extracted latent variables does not automatically mean that their axes are perpendicular in the original space.]
This sort of question did appear several times on CV (you have to browse through PCA clustering
questions). The short answer to your question is yes, it makes sense inspecting junior dimensions in search for a structure (such as clusters) in your data. But why not? Often senior components explaining the lion's share of the variance are irrelevant to the currently important distinctions in the data. I might cut a loaf of bread lengthwise; then the 1st PC of that ellipsoid won't show the two halves, but PC2 or PC3 is likely to show it - the bimodality.
One should remember that dimensionality reduction methods (such as PCA, PCoA) are not intended to find clusters or to map classes the best way. They do not replace cluster analysis or discriminant analysis, therefore. With PCA or alike techniques, you only can hope that some dimensions will uncover the structure for you.
Just one example. Here is two scatterplots of the same 2-class data. One shows the first PC drawn on it, the other shows the discriminant function drawn. Neither PC1 or the remaining, orthogonal to it, PC2, alone, isn't quite bimodal. Discriminant is much better in that respect, because it was extracted for the purpose to capture the difference between the two classes.
Analytically logical pass to uncover-then-plot structure would be to perform cluster analysis (or latent class analysis) to form classes, then to use discriminant analysis (or, perhaps, multidimensional INDSCAL scaling) to plot those. However, discriminant analysis (DA) results are, naturally, dependent on the classes. PCA/PCoA results are not - since they are unsupervised and are blind to the nonhomogeneity in the data. But that is exactly the reason (or at least one of) why many people would prefer to attempt PCA instead of DA in order to visualize class distinctions.
You say, To me this feels like you are fishing for the results that you want to see
. This apprehension would be relevant in the context of multiple statistical significance testing and not in the present context of exploratory data analysis. Yes, EDA is "fishing" for revelations that might look good to you, it's what it is about. On the other hand, if you prefer to think of junior dimensions of the data as noise (rather than weak but substantive ones) dimensions, then indeed the "fishing" claim is appropriate. PCA itself does not separate signal from noise. One has to analyze dimensions statistically if they theoretically resemble noise or signal, but that implies assumptions about the data; so greet the vicious circle. But, fortunately, with a sufficiently large sample size, noise dimensions are likely to dither real class differences, not to fake them.
Best Answer
Here's a cool excerpt from Jolliffe (1982) that I didn't include in my previous answer to the very similar question, "Low variance components in PCA, are they really just noise? Is there any way to test for it?" I find it pretty intuitive.
The three examples from literature referred to in the last sentence of the second paragraph were the three I mentioned in my answer to the linked question.
Reference
Jolliffe, I. T. (1982). Note on the use of principal components in regression. Applied Statistics, 31(3), 300–303. Retrieved from http://automatica.dei.unipd.it/public/Schenato/PSC/2010_2011/gruppo4-Building_termo_identification/IdentificazioneTermodinamica20072008/Biblio/Articoli/PCR%20vecchio%2082.pdf.