Time Series – PCA for Selecting Stocks to Mimic an Index

clusteringpcatime series

I am looking for a way to select a subset of stocks which returns can approximately mimick the return of an 'index' they corresponds to.

These are all self created factor portfolios, and not traded indices. But each portfolio consists of too many stocks for me to replicate in a total portfolio. One approach would of course be to limit the selected stocks in the factor research process, but I would rather find another way to construct the final factor portfolios.

I was looking at PCA based on the "dimensionality reduction" feature. And I (sort of) understand the math, but I struggle to understand the outputs and how to use it in further analysis.

When I use stock returns for N number of stocks over K number of days I end up with eigenvectors with lenght N.
I just use the one with the highest eigenvalue, but what I don't understand is how this is reducing the dimensions? If I transform the eigenvector to weights I still get a weight for each stock and I am no better of in terms of just using all the stocks?

And I am not even sure transforming to stock weights make sense, but I do it like this:

weights = abs(pca_components['pc1'])/sum(abs(pca_components['pc1']))

And also, how to interpret the second best eigenvector in terms of variance explained? If i transform that into weights i would end up with one more portfolio but with other weights than the first one and combined that would just be a third portfolio, that probably doesnt make sense.

Instead of applying weights like this, I could also plot the data and look for clusters and pick some random stocks manually for each cluster and check how that did compared to the index, but some other method must exist.

And, are there any better statistical methods to achieve my goal?

Best Answer

You are right in thinking that each of the principal components is a linear combination of all the stocks. If I understand you all right, what you are after is a subset of stocks from each portfolio that will mimic the performance of the full portfolio. If that is the case, you might consider a regression where the response $y$ is the value (or return) of the whole portfolio and the regressors the values (or returns) of the individual stocks, and try to select a small number of regressors.

For the selection of regressors you might turn to stepwise regression, all-subsets regression or the lasso. The first option (stepwise regression) has been subject to considerable criticism.