Unlike factor analysis, you cannot just put eight variables into a "regression test" and treat them all equally. One variable has to be the response variable and the others explanatory variables.

Your eight factors have been specifically designed to be orthogonal to eachother. I suspect you have put seven of the factors into a regression as explanatory variables (sometimes called "independent variables") with the eighth as the response variable (sometimes called "dependent variable"). Certainly you would find a low $R^2$ value, and estimates of the coefficient parameters close to zero, in this case.

Using factors as explanatory variables in a regression is sometimes justifiable (although I have some qualms myself - see @whuber's answer here for one reason why it might be questionable). However, the response variable for the regression needs to have been kept out of the original factor analysis. So you can only use your eight factors in a regression if the intent is to explain a ninth variable, one that was not in the original factor analysis.

Factor analysis is essentially a (constrained) linear regression model. In this model, each analyzed variable is the dependent variable, common factors are the IVs, and the implied unique factor serve as the error term. (The constant term is set to zero due to centering or standardizing which are implied in computation of covariances or correlations.) So, exactly like in linear regression, there could exist "strong" assumption of normality - IVs (common factors) are multivariate normal and errors (unique factor) are normal, which automatically leads to that the DV is normal; and "weak" assumption of normality - errors (unique factor) are normal only, therefore the DV needs not to be normal. Both in regression and FA we usually admit "weak" assumption because it is more realistic.

Among classic FA extraction methods only the maximum likelihood method, because it departs from the characteristics of population, states that the analyzed variables be multivariate normal. Methods like principal axes or minimal residuals do not require this "strong" assumption (albeit you can make it anyway).

Please remember that even if your variables are normal separately, it doesn't necessarily guarantee that your data are multivariate normal.

Let us accept "weak" assumption of normality. What is the potential threat coming from strongly skewed data, like your, then? It is outliers. If the distribution of a variable is strongly asymmetric the longer tail becomes extra influential in computing correlations or covariances, and simultaneously it provokes apprehension about whether it still measures the same psychological construct (the factor) as the shorter tail does. It might be cautious to compare whether correlation matrices built on the lower half and the upper half of the rating scale are similar or not. If they are similar enough, you may conclude that both tails measure the same thing and do not transform your variables. Otherwise you should consider transforming or some other action to neutralize the effect of "outlier" long tail.

Transformations are plenty. For example, raising to a power>1 or exponentiation are used for left-skewed data, and power<1 or logarithm - for right-skewed. My own experience says that so called *optimal transformation* via **Categorical PCA** performed prior FA is almost always beneficial, for it usually leads to more clear, interpretable factors in FA; under the assumption that the number of factors is known, it transforms your data nonlinearly so as to maximize the overall variance accounted by that number of factors.

## Best Answer

Disclaimer: @ttnphns is very knowledgeable about both PCA and FA, and I respect his opinion and have learned a lot from many of his great answers on the topic. However, I tend to disagree with his reply here, as well as with other (numerous) posts on this topic here on CV, not only his; or rather, I think they have limited applicability.I think that the difference between PCA and FA is overrated.

Look at it like that: both methods attempt to provide a low-rank approximation of a given covariance (or correlation) matrix. "Low-rank" means that only a limited (low) number of latent factors or principal components is used. If the $n \times n$ covariance matrix of the data is $\mathbf C$, then the models are:

\begin{align} \mathrm{PCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top \\ \mathrm{PPCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \sigma^2 \mathbf I \\ \mathrm{FA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \boldsymbol \Psi \end{align}

Here $\mathbf W$ is a matrix with $k$ columns (where $k$ is usually chosen to be a small number, $k<n$), representing $k$ principal components or factors, $\mathbf I$ is an identity matrix, and $\boldsymbol \Psi$ is a diagonal matrix. Each method can be formulated as finding $\mathbf W$ (and the rest) minimizing the [norm of the] difference between left-hand and right-hand sides.

PPCA stands for

probabilistic PCA, and if you don't know what that is, it does not matter so much for now. I wanted to mention it, because it neatly fits between PCA and FA, having intermediate model complexity. It also puts the allegedly large difference between PCA and FA into perspective: even though it is a probabilistic model (exactly like FA), it actually turns out to be almost equivalent to PCA ($\mathbf W$ spans the same subspace).Most importantly, note that the models only differ in how they treat the diagonal of $\mathbf C$. As the dimensionality $n$ increases, the diagonal becomes in a way less and less important (because there are only $n$ elements on the diagonal and $n(n-1)/2 = \mathcal O (n^2)$ elements off the diagonal). As a result, for the large $n$ there is usually not much of a difference between PCA and FA at all, an observation that is rarely appreciated. For small $n$ they can indeed differ a lot.

Now to answer your main question as to why people in some disciplines seem to prefer PCA. I guess it boils down to the fact that it is mathematically a lot easier than FA (this is not obvious from the above formulas, so you have to believe me here):

PCA -- as well as PPCA, which is only slightly different, -- has an analytic solution, whereas FA does not. So FA needs to be numerically fit, there exist various algorithms of doing it, giving possibly different answers and operating under different assumptions, etc. etc. In some cases some algorithms can get stuck (see e.g. "heywood cases"). For PCA you perform an eigen-decomposition and you are done; FA is a lot more messy.

Technically, PCA simply rotates the variables, and that is why one can refer to it as a mere transformation, as @NickCox did in his comment above.

PCA solution does not depend on $k$: you can find first three PCs ($k=3$) and the first two of those are going to be identical to the ones you would find if you initially set $k=2$. That is not true for FA: solution for $k=2$ is not necessarily contained inside the solution for $k=3$. This is counter-intuitive and confusing.

Of course FA is more flexible model than PCA (after all, it has more parameters) and can often be more useful. I am not arguing against that. What I

amarguing against, is the claim that they are conceptually very different with PCA being about "describing the data" and FA being about "finding latent variables". I just do not see this is as true [almost] at all.To comment on some specific points mentioned above and in the linked answers:

"in PCA the number of dimensions to extract/retain is fundamentally subjective, while in EFA the number is fixed, and you usually have to check several solutions"-- well, the choice of the solution is still subjective, so I don't see any conceptual difference here. In both cases, $k$ is (subjectively or objectively) chosen to optimize the trade-off between model fit and model complexity."FA is able to explain pairwise correlations (covariances). PCA generally cannot do it"-- not really, both of them explain correlations better and better as $k$ grows.Sometimes extra confusion arises (but not in @ttnphns's answers!) due to the different

practicesin the disciplines using PCA and FA. For example, it is a common practice to rotate factors in FA to improve interpretability. This is rarely done after PCA, but in principle nothing is preventing it. So people often tend to think that FA gives you something "interpretable" and PCA does not, but this is often an illusion.Finally, let me stress again that for very small $n$ the differences between PCA and FA can indeed be large, and maybe some of the claims in favour of FA are done with small $n$ in mind. As an extreme example, for $n=2$ a single factor can always perfectly explain the correlation, but one PC can fail to do it quite badly.

## Update 1: generative models of the data

You can see from the number of comments that what I am saying is taken to be controversial. At the risk of flooding the comment section even further, here are some remarks regarding "models" (see comments by @ttnphns and @gung). @ttnphns does not like that I used the word "model"

[of the covariance matrix]to refer to the approximations above; it is an issue of terminology, but what he calls "models" are probabilistic/generative modelsof the data:\begin{align} \mathrm{PPCA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \sigma^2 \mathbf I) \\ \mathrm{FA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \boldsymbol \Psi) \end{align}

Note that PCA is not a probabilistic model, and cannot be formulated in this way.

The difference between PPCA and FA is in the noise term: PPCA assumes the same noise variance $\sigma^2$ for each variable, whereas FA assumes different variances $\Psi_{ii}$ ("uniquenesses"). This minor difference has important consequences. Both models can be fit with a general expectation-maximization algorithm. For FA no analytic solution is known, but for PPCA one can analytically derive the solution that EM will converge to (both $\sigma^2$ and $\mathbf W$). Turns out, $\mathbf W_\mathrm{PPCA}$ has columns in the same direction but with a smaller length than standard PCA loadings $\mathbf W_\mathrm{PCA}$ (I omit exact formulas). For that reason I think of PPCA as "almost" PCA: $\mathbf W$ in both cases span the same "principal subspace".

The proof (Tipping and Bishop 1999) is a bit technical; the intuitive reason for why homogeneous noise variance leads to a much simpler solution is that $\mathbf C - \sigma^2 \mathbf I$ has the same eigenvectors as $\mathbf C$ for any value of $\sigma^2$, but this is not true for $\mathbf C - \boldsymbol \Psi$.

So yes, @gung and @ttnphns are right in that FA is based on a generative model and PCA is not, but I think it is important to add that PPCA is also based on a generative model, but is "almost" equivalent to PCA. Then it ceases to seem such an important difference.

## Update 2: how come PCA provides best approximation to the covariance matrix, when it is well-known to be looking for maximal variance?

PCA has two equivalent formulations: e.g. first PC is (a) the one maximizing the variance of the projection and (b) the one providing minimal reconstruction error. More abstractly, the equivalence between maximizing variance and minimizing reconstruction error can be seen using Eckart-Young theorem.

If $\mathbf X$ is the data matrix (with observations as rows, variables as columns, and columns are assumed to be centered) and its SVD decomposition is $\mathbf X=\mathbf U\mathbf S\mathbf V^\top$, then it is well known that columns of $\mathbf V$ are eigenvectors of the scatter matrix (or covariance matrix, if divided by the number of observations) $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$ and so they are axes maximizing the variance (i.e. principal axes). But by the Eckart-Young theorem, first $k$ PCs provide the best rank-$k$ approximation to $\mathbf X$: $\mathbf X_k=\mathbf U_k\mathbf S_k \mathbf V^\top_k$ (this notation means taking only $k$ largest singular values/vectors) minimizes $\|\mathbf X-\mathbf X_k\|^2$.

The first $k$ PCs provide not only the best rank-$k$ approximation to $\mathbf X$, but also to the covariance matrix $\mathbf C$. Indeed, $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$, and the last equation provides the SVD decomposition of $\mathbf C$ (because $\mathbf V$ is orthogonal and $\mathbf S^2$ is diagonal). So the Eckert-Young theorem tells us that the best rank-$k$ approximation to $\mathbf C$ is given by $\mathbf C_k = \mathbf V_k\mathbf S_k^2\mathbf V_k^\top$. This can be transformed by noticing that $\mathbf W = \mathbf V\mathbf S$ are PCA loadings, and so $$\mathbf C_k=\mathbf V_k\mathbf S_k^2\mathbf V^\top_k=(\mathbf V\mathbf S)_k(\mathbf V\mathbf S)_k^\top=\mathbf W_k\mathbf W^\top_k.$$

The bottom-line here is that $$ \mathrm{minimizing} \; \left\{\begin{array}{ll} \|\mathbf C-\mathbf W\mathbf W^\top\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\sigma^2\mathbf I\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\boldsymbol\Psi\|^2\end{array}\right\} \; \mathrm{leads \: to} \; \left\{\begin{array}{cc} \mathrm{PCA}\\ \mathrm{PPCA} \\ \mathrm{FA} \end{array}\right\} \; \mathrm{loadings},$$ as stated in the beginning.

## Update 3: numerical demonstration that PCA$\to$FA when $n \to \infty$

I was encouraged by @ttnphns to provide a numerical demonstration of my claim that as dimensionality grows, PCA solution approaches FA solution. Here it goes.

I generated a $200\times 200$ random correlation matrix with some strong off-diagonal correlations. I then took the upper-left $n \times n$ square block $\mathbf C$ of this matrix with $n=25, 50, \dots 200$ variables to investigate the effect of the dimensionality. For each $n$, I performed PCA and FA with number of components/factors $k=1\dots 5$, and for each $k$ I computed the off-diagonal reconstruction error $$\sum_{i\ne j}\left[\mathbf C - \mathbf W \mathbf W^\top\right]^2_{ij}$$ (note that on the diagonal, FA reconstructs $\mathbf C$ perfectly, due to the $\boldsymbol \Psi$ term, whereas PCA does not; but the diagonal is ignored here). Then for each $n$ and $k$, I computed the ratio of the PCA off-diagonal error to the FA off-diagonal error. This ratio has to be above $1$, because FA provides the best possible reconstruction.

On the right, different lines correspond to different values of $k$, and $n$ is shown on the horizontal axis. Note that as $n$ grows, ratios (for all $k$) approach $1$, meaning that PCA and FA yield approximately the same loadings, PCA$\approx$FA. With relatively small $n$, e.g. when $n=25$, PCA performs [expectedly] worse, but the difference is not that strong for small $k$, and even for $k=5$ the ratio is below $1.2$.

The ratio can become large when the number of factors $k$ becomes comparable with the number of variables $n$. In the example I gave above with $n=2$ and $k=1$, FA achieves $0$ reconstruction error, whereas PCA does not, i.e. the ratio would be infinite. But getting back to the original question, when $n=21$ and $k=3$, PCA will only moderately lose to FA in explaining the off-diagonal part of $\mathbf C$.

For an illustrated example of PCA and FA applied to a real dataset (wine dataset with $n=13$), see my answers here: