Solved – Curve fitting multivariate data for maximal correlation with univariate data

curve fittinggenetic algorithmsmultivariate analysis

I have multivariate time series data of the EURUSD financial vehicle. In this data each variable represents a different metric. There are ~200,000 rows and ~20 variables.
There are no NULL values for any variable at any row. All data is numerical.

Alongside this data, at each time point I have the univariate data "Profit."

I want to curve-fit a function to transforms my multivariate data set into a new univariate data set which having the MAXIMAL correlation to my "Profit" variable.

In other words, I want to iterate through different mathematical transformations of my multivariate data set until I find the one that is optimally correlated with my "Profit" data.

What is the best way to do this?
From what I understand, a genetic algorithm should work well.

Best Answer

The traditional approach to this sort of problem is:

  • if you have a theoretical reason for a relationship between your explanatory variables and your response (profit), then base a model on that, and test it rigorously...
  • if you don't, then look at the 20 plots with each of your variables on the horizontal axis and the response (profit) on the vertical axis, and look for obvious relationships, or transformations (logarithm normally the first one) that make relationships reasonably straightforward - if not linear, at least easily approximated by splines or locally linear regressions (see StasK's comment)
  • then, create a set of plausible linear models with profit as your response and your transformed or splined (if that's a word) variables as explanatory variables. Compare the models against some criterion of goodness of fit eg AIC or BIC (plenty of debate on which to use). Be careful to adjust p-values downwards to allow for the fact that you have implicitly looked at 2^20 different models.

Unfortunately, any of those dot points above could be a major chapter or book. R can do anything necessary. I'd use plots rather than correlation co-efficients; and read some of the large literature on model selection and fitting.

Related Question