Solved – What algorithm should I use to predict a continuous dependent variable from multiple continuous & categorical independent variables

information retrievalmachine learningmultiple regressionregression

I'm software engineer of an E-commerce company, facing a problem like this:

An e-commerce shop sells their products daily and wants to know what conditions that might improve their sales. I'm building a AI sales predictor based on:

Categorical variables

  • week days (Mon, Tue, Wed,… Sun)
  • day period in a month (<10, 10<= … <= 20, >20)
  • event level of that day (A, B, C, S, R)

Continuous variables

  • number of months data has been training (1, 2, 3, 4, …)

I'm looking for a best model to fit mixed independent variables like this. Any ideas or sites could you redirect me to?

Many thanks!!

Best Answer

This is not that difficult. First, transform day of the week into 7 binary (0,1) features. For each record in your dataset, perhaps a day, only one of the 7 binary day variables will be 1, for example Wednesday would be $x=\{0,0,0,1,0,0,0\}$. Next, transform the day period of the month into a set of binary (0,1) features, and set to 1 for what ever a record falls in. Next, set your event levels for that day also to e.g. $x=\{1,0,0,0,0\}$. Initially, probably drop your temporal variable on months the data have been training. First, try using linear regression with daily sales as the dependent feature, and all the binary as predictors. Also, specify that no constant (y-intercept) is to be generated. (this is called sum-to-zero constraints).

Also, run a histogram of daily sales amounts, and get the skewness value, since it's probably log-normally distributed with a right tail. When sales is right skewed, take the log_e(sales) of every day and then run a histogram and it should be more normally distributed. I would probably use the log_e(sales) as the dependent as there will be more days with lower sales than higher.

For now, try to see which independents are significant predictors of log daily sales amounts using multiple linear regression.

AI is overkill initially, since you need to know how to recode your data to solve the problem. Have you thought about recoding the output into low-high (0,1) sales days? Or even assigning each day based on sales into quartiles and then recoding each daily sales amount into $y=\{0,0,0,1\}$ based on cutpoints for the 25th, 50th, and 75th percentiles of the distribution of log_e(sales)?

You can't want an AI "sausage machine" that predicts daily sales before you master data transformations and recoding. Using AI is also going to be a problem if any features are correlated. You may also run into a problem that requires LHS (Latin hypercube sampling) for your continuous predictors, and target, since neural networks can fail miserably if the proportion of data over the range of each continuous input-output feature is jumpy(sparse/dense).

Related Question