Solved – How to find how independent variables affect dependent variables

model selectionregression

I'm a bit of a novice at maths and am trying to get my head around a problem.

I have 3 independent variables which affect 1 dependent variable. I want to create a 4D model which will give me the 4th dimension when I give it an (x, y, z) triplet.

I am programming in Java and already have a regression function which will take a set of independent variables and give me coefficients of those independent variables which best fit the data supplied.

What I am trying to figure out is which independent variables to use.

I have tried various cubic functions, with independent variables something like:
{1 + x + y + z + xx + xy + xz + yy + yz + zz + xxx + xxy + xxz + yyy + yyx + yyz + zzz + zzy + zzx + xyz}

Then when the resulting model looked a bit wrong I thought, ah maybe the x and y values don't mean anything when multiplied, so took out the variables where x and y were together. Now it still isn't right and I'm worried I'm just going about it in entirely the wrong way. Maybe there's an exponential in there somewhere?

Is there some mathematical method to finding exactly which variables I should be using?

Cheers!


The data I already have is like this. It's to do with calculating final scores in a cricket game based on which batsmen are in and how far through the game we are, the final score is the dependent variable:

X axis ranges from 0 to 10 inclusive (the order of the first batsman in).
Y axis ranges from 0 to 10 inclusive (the order of the second batsman in).
Z axis ranges from 0 to 19 inclusive (the over we are in, basically means how far through the game we are).

The lower x and y are, the higher the final score will be, as the team have better batsmen still in.
The higher the over (when x and y are the same), the higher the final score will be, because the batsmen have lasted longer and so should have their eye in.

I guess it's "what I expect the dependent variable to be" which is the question. How does each parameter effect the final score. I can post some sample data if you want.

I have calculated a data point for each (X, Y, Z) combination, so can't get more. I have data from all the cricket games, and each data point is the average final score of games where this situation has occurred. Some situations ((x, y, z) triples) are far more likely to occur (in more average games) and have been weighted in the regression function accordingly.

Best Answer

There is no exact science behind including or omitting explanatory variables, as adding additional variables changes the meaning of your model. By adding additional variables, what you basically say is: How does x affect y, keeping all other explanatory variables constant?

Apart from paying attention to the p-values, you should also have a look at the $ R^2 $, which measures which portion of the variance in the population is explained by your model.

Unless you are interested in coding the regression algorithm yourself, you might want to have a look at the excellent Weka Java Library or R, an open source statistics software.

Related Question