Solved – What algorithm should I use to predict a continuous dependent variable from multiple continuous & categorical independent variables

information retrievalmachine learningmultiple regressionregression

I'm software engineer of an E-commerce company, facing a problem like this:

An e-commerce shop sells their products daily and wants to know what conditions that might improve their sales. I'm building a AI sales predictor based on:

Categorical variables

week days (Mon, Tue, Wed,… Sun)
day period in a month (<10, 10<= … <= 20, >20)
event level of that day (A, B, C, S, R)

Continuous variables

number of months data has been training (1, 2, 3, 4, …)

I'm looking for a best model to fit mixed independent variables like this. Any ideas or sites could you redirect me to?

Many thanks!!

Best Answer

This is not that difficult. First, transform day of the week into 7 binary (0,1) features. For each record in your dataset, perhaps a day, only one of the 7 binary day variables will be 1, for example Wednesday would be $x=\{0,0,0,1,0,0,0\}$. Next, transform the day period of the month into a set of binary (0,1) features, and set to 1 for what ever a record falls in. Next, set your event levels for that day also to e.g. $x=\{1,0,0,0,0\}$. Initially, probably drop your temporal variable on months the data have been training. First, try using linear regression with daily sales as the dependent feature, and all the binary as predictors. Also, specify that no constant (y-intercept) is to be generated. (this is called sum-to-zero constraints).

Also, run a histogram of daily sales amounts, and get the skewness value, since it's probably log-normally distributed with a right tail. When sales is right skewed, take the log_e(sales) of every day and then run a histogram and it should be more normally distributed. I would probably use the log_e(sales) as the dependent as there will be more days with lower sales than higher.

For now, try to see which independents are significant predictors of log daily sales amounts using multiple linear regression.

AI is overkill initially, since you need to know how to recode your data to solve the problem. Have you thought about recoding the output into low-high (0,1) sales days? Or even assigning each day based on sales into quartiles and then recoding each daily sales amount into $y=\{0,0,0,1\}$ based on cutpoints for the 25th, 50th, and 75th percentiles of the distribution of log_e(sales)?

You can't want an AI "sausage machine" that predicts daily sales before you master data transformations and recoding. Using AI is also going to be a problem if any features are correlated. You may also run into a problem that requires LHS (Latin hypercube sampling) for your continuous predictors, and target, since neural networks can fail miserably if the proportion of data over the range of each continuous input-output feature is jumpy(sparse/dense).

Related Solutions

Solved – Cannibalization of product sales

If you wish to determine the impact of sales of product B on Product A , you must look at the conditional effect. The conditions that you might need to consider are 1) day-of-the-week ; 2) week-of-the-year ; 3) month-of-the-year 4) specific days-of-the-month ; 5) lead and lag effects around each holiday/event 5) Monday-after-a Friday event ; 6) Friday-before a Monday event ; 7) particular-weeks-in-the-month ; 8) ARIMA structure ; 9) Level Shifts, Local Time Trends , Seasonal Pulses, Pulses ; 9) changes in parameters over time; 10) changes in variance over time ; 11) impact of price/promotions ...... to name a few. I have written a paper on this subject, please see http://www.autobox.com/cms/index.php/news/131-102706-white-paper-on-cannibalization-qtesting-market-hypothesisq-by-john-c-pickett-david-p-reilly-view

Hidden Markov Models – Semi-Hidden Markov Model with Emission Probabilities Based on Regressors

Here is some quick R code to get you started. Major caveats: I wrote this myself, so it could be buggy, statistically incorrect, poorly styled... use at your own risk!

params_are_valid <- function(params) {
    stopifnot("lambdas" %in% names(params))   # Poisson parameters for each hidden state
    stopifnot(all(params$lambas >= 0.0))
    stopifnot(length(params$lambdas) == params$n_hidden_states)
    stopifnot("mu" %in% names(params))  # Initial distribution over hidden states
    stopifnot(length(params$mu) == params$n_hidden_states)
    stopifnot(isTRUE(all.equal(sum(params$mu), 1.0)))  # Probabilities sum to 1
    stopifnot(all(params$mu >= 0.0))
    stopifnot("P" %in% names(params))  # Transition probabilities for hidden state
    stopifnot(nrow(params$P) == params$n_hidden_states && ncol(params$P) == params$n_hidden_states)
    stopifnot(isTRUE(all.equal(rowSums(params$P), rep(1, params$n_hidden_states))))  # Probabilities sum to 1
    stopifnot(all(params$P >= 0))
    return(TRUE)
}

baum_welch_poisson <- function(observed_y, params) {
    ## Baum-Welch algorithm for HMM with discrete hidden x
    ## Discrete observations y (NAs allowed) with y|x ~ Poisson(params$lambdas[x])
    ## Written following Ramon van Handel's HMM notes, page 40, algorithm 3.2
    ## https://www.princeton.edu/~rvan/orf557/hmm080728.pdf
    ## Careful, his observation index is in {0, 1, ... , n} while I use {1, 2, ... , length(observed_y)}
    y_length <- length(observed_y)
    stopifnot(y_length > 1)
    stopifnot(all(observed_y >= 0 | is.na(observed_y)))
    stopifnot(all(observed_y %% 1 == 0 | is.na(observed_y)))  # Integer observations
    stopifnot(params_are_valid(params))
    c <- vector("numeric", y_length)
    if(is.na(observed_y[1])) {
        upsilon <- rep(1, params$n_hidden_states)
    } else {
        upsilon <- dpois(observed_y[1], lambda=params$lambdas)
    }
    stopifnot(length(upsilon) == params$n_hidden_states)
    c[1] <- sum(upsilon * params$mu)
    ## Matrix pi_contemporaneous gives probabilities over x_k conditional on {y_1, y_2, ... , y_k}
    ## Notation in van Handel's HMM notes is pi_k, whereas pi_{k|n} conditions on full history of y
    pi_contemporaneous <- matrix(NA, params$n_hidden_states, y_length)
    pi_contemporaneous[, 1] <- upsilon * params$mu / c[1]
    upsilon_list <- list()
    upsilon_list[[1]] <- upsilon
    for(k in seq(2, y_length)) {
        ## Forward loop
        if(is.na(observed_y[k])) {
            upsilon <- rep(1, params$n_hidden_states)
        } else {
            upsilon <- dpois(observed_y[k], lambda=params$lambdas)
        }
        upsilon_list[[k]] <- upsilon  # Cache for backward loop
        pi_tilde <- upsilon * t(params$P) %*% pi_contemporaneous[, k-1]
        c[k] <- sum(pi_tilde)
        pi_contemporaneous[, k] <- pi_tilde  / c[k]
    }
    beta <- matrix(NA, params$n_hidden_states, y_length)
    beta[, y_length] <- 1 / c[y_length]
    ## Matrix pi gives probabilities over x conditional on full history of y
    ## Notation in van Handel's HMM notes is pi_{k|n}, as opposed to pi_k
    pi <- matrix(NA, params$n_hidden_states, y_length)
    pi[, y_length] <- pi_contemporaneous[, y_length]
    pi_transition_list <- list()  # List of posterior probabilities over hidden x transitions
    for(k in seq(1, y_length - 1)) {
        ## Backward loop
        upsilon <- diag(upsilon_list[[y_length - k + 1]], params$n_hidden_states, params$n_hidden_states)
        pi_matrix <- diag(pi_contemporaneous[, y_length - k],
                          params$n_hidden_states, params$n_hidden_states)
        beta_matrix <- diag(beta[, y_length - k + 1], params$n_hidden_states, params$n_hidden_states)
        beta[, y_length - k] <- params$P %*% upsilon %*% beta[, y_length - k + 1] / c[y_length - k]
        pi_transition_list[[y_length - k]] <- pi_matrix %*% params$P %*% upsilon %*% beta_matrix
        stopifnot(isTRUE(all.equal(sum(pi_transition_list[[y_length - k]]), 1.0)))
        pi[, y_length - k] <- rowSums(pi_transition_list[[y_length - k]])
    }
    loglik <- sum(log(c))
    return(list(loglik=loglik, pi=pi, pi_transition_list=pi_transition_list))
}

## Notice that params0$lambas is zero for first hidden state, i.e. first hidden state represents a stockout
params0 <- list("n_hidden_states"=2,
                "mu"=c(0.10, 0.90),  # Initial distribution over hidden states
                "P"=rbind(c(0.50, 0.50),
                          c(0.10, 0.90)),  # Transition probabilities for hidden state
                "lambdas"=c(0, 2))  # Observe y|x ~ Poisson(lambdas[x])

observations <- c(1, 2, 3, NA, NA, 0, 0, 0, 5, 5, 0, 5, 6, 7, NA, 5, 8, 9, 0, 6)
bw_list <- baum_welch_poisson(observations, params0)

round(bw_list$pi, 3)  # Posterior probabilities over hidden states given observations (first row is stockout state)
bw_list$pi[, which(observations != 0)]  # Sanity check: we're certain we're in hidden state 2 whenever observations > 0
bw_list$pi[, which(observations == 0 |
                   is.na(observations))]  # When observations are zero (or NA), we could be in either state

The code runs Baum-Welch on a simple two-state HMM with Poisson observations. In this case, I set the poisson rate (lambda) to be zero for the first hidden state (and positive for the second hidden state), which makes the first hidden state a "stockout" state as in your example.

The example does not actually fit parameters (estimate them from data) -- for that you would need to write e.g. an expectation-maximization (EM) function.

Best Answer

Related Solutions

Solved – Cannibalization of product sales

Hidden Markov Models – Semi-Hidden Markov Model with Emission Probabilities Based on Regressors

Related Question