Here is some quick R code to get you started. Major caveats: I wrote this myself, so it could be buggy, statistically incorrect, poorly styled... use at your own risk!
params_are_valid <- function(params) {
stopifnot("lambdas" %in% names(params)) # Poisson parameters for each hidden state
stopifnot(all(params$lambas >= 0.0))
stopifnot(length(params$lambdas) == params$n_hidden_states)
stopifnot("mu" %in% names(params)) # Initial distribution over hidden states
stopifnot(length(params$mu) == params$n_hidden_states)
stopifnot(isTRUE(all.equal(sum(params$mu), 1.0))) # Probabilities sum to 1
stopifnot(all(params$mu >= 0.0))
stopifnot("P" %in% names(params)) # Transition probabilities for hidden state
stopifnot(nrow(params$P) == params$n_hidden_states && ncol(params$P) == params$n_hidden_states)
stopifnot(isTRUE(all.equal(rowSums(params$P), rep(1, params$n_hidden_states)))) # Probabilities sum to 1
stopifnot(all(params$P >= 0))
return(TRUE)
}
baum_welch_poisson <- function(observed_y, params) {
## Baum-Welch algorithm for HMM with discrete hidden x
## Discrete observations y (NAs allowed) with y|x ~ Poisson(params$lambdas[x])
## Written following Ramon van Handel's HMM notes, page 40, algorithm 3.2
## https://www.princeton.edu/~rvan/orf557/hmm080728.pdf
## Careful, his observation index is in {0, 1, ... , n} while I use {1, 2, ... , length(observed_y)}
y_length <- length(observed_y)
stopifnot(y_length > 1)
stopifnot(all(observed_y >= 0 | is.na(observed_y)))
stopifnot(all(observed_y %% 1 == 0 | is.na(observed_y))) # Integer observations
stopifnot(params_are_valid(params))
c <- vector("numeric", y_length)
if(is.na(observed_y[1])) {
upsilon <- rep(1, params$n_hidden_states)
} else {
upsilon <- dpois(observed_y[1], lambda=params$lambdas)
}
stopifnot(length(upsilon) == params$n_hidden_states)
c[1] <- sum(upsilon * params$mu)
## Matrix pi_contemporaneous gives probabilities over x_k conditional on {y_1, y_2, ... , y_k}
## Notation in van Handel's HMM notes is pi_k, whereas pi_{k|n} conditions on full history of y
pi_contemporaneous <- matrix(NA, params$n_hidden_states, y_length)
pi_contemporaneous[, 1] <- upsilon * params$mu / c[1]
upsilon_list <- list()
upsilon_list[[1]] <- upsilon
for(k in seq(2, y_length)) {
## Forward loop
if(is.na(observed_y[k])) {
upsilon <- rep(1, params$n_hidden_states)
} else {
upsilon <- dpois(observed_y[k], lambda=params$lambdas)
}
upsilon_list[[k]] <- upsilon # Cache for backward loop
pi_tilde <- upsilon * t(params$P) %*% pi_contemporaneous[, k-1]
c[k] <- sum(pi_tilde)
pi_contemporaneous[, k] <- pi_tilde / c[k]
}
beta <- matrix(NA, params$n_hidden_states, y_length)
beta[, y_length] <- 1 / c[y_length]
## Matrix pi gives probabilities over x conditional on full history of y
## Notation in van Handel's HMM notes is pi_{k|n}, as opposed to pi_k
pi <- matrix(NA, params$n_hidden_states, y_length)
pi[, y_length] <- pi_contemporaneous[, y_length]
pi_transition_list <- list() # List of posterior probabilities over hidden x transitions
for(k in seq(1, y_length - 1)) {
## Backward loop
upsilon <- diag(upsilon_list[[y_length - k + 1]], params$n_hidden_states, params$n_hidden_states)
pi_matrix <- diag(pi_contemporaneous[, y_length - k],
params$n_hidden_states, params$n_hidden_states)
beta_matrix <- diag(beta[, y_length - k + 1], params$n_hidden_states, params$n_hidden_states)
beta[, y_length - k] <- params$P %*% upsilon %*% beta[, y_length - k + 1] / c[y_length - k]
pi_transition_list[[y_length - k]] <- pi_matrix %*% params$P %*% upsilon %*% beta_matrix
stopifnot(isTRUE(all.equal(sum(pi_transition_list[[y_length - k]]), 1.0)))
pi[, y_length - k] <- rowSums(pi_transition_list[[y_length - k]])
}
loglik <- sum(log(c))
return(list(loglik=loglik, pi=pi, pi_transition_list=pi_transition_list))
}
## Notice that params0$lambas is zero for first hidden state, i.e. first hidden state represents a stockout
params0 <- list("n_hidden_states"=2,
"mu"=c(0.10, 0.90), # Initial distribution over hidden states
"P"=rbind(c(0.50, 0.50),
c(0.10, 0.90)), # Transition probabilities for hidden state
"lambdas"=c(0, 2)) # Observe y|x ~ Poisson(lambdas[x])
observations <- c(1, 2, 3, NA, NA, 0, 0, 0, 5, 5, 0, 5, 6, 7, NA, 5, 8, 9, 0, 6)
bw_list <- baum_welch_poisson(observations, params0)
round(bw_list$pi, 3) # Posterior probabilities over hidden states given observations (first row is stockout state)
bw_list$pi[, which(observations != 0)] # Sanity check: we're certain we're in hidden state 2 whenever observations > 0
bw_list$pi[, which(observations == 0 |
is.na(observations))] # When observations are zero (or NA), we could be in either state
The code runs Baum-Welch on a simple two-state HMM with Poisson observations. In this case, I set the poisson rate (lambda) to be zero for the first hidden state (and positive for the second hidden state), which makes the first hidden state a "stockout" state as in your example.
The example does not actually fit parameters (estimate them from data) -- for that you would need to write e.g. an expectation-maximization (EM) function.
Best Answer
This is not that difficult. First, transform day of the week into 7 binary (0,1) features. For each record in your dataset, perhaps a day, only one of the 7 binary day variables will be 1, for example Wednesday would be $x=\{0,0,0,1,0,0,0\}$. Next, transform the day period of the month into a set of binary (0,1) features, and set to 1 for what ever a record falls in. Next, set your event levels for that day also to e.g. $x=\{1,0,0,0,0\}$. Initially, probably drop your temporal variable on months the data have been training. First, try using linear regression with daily sales as the dependent feature, and all the binary as predictors. Also, specify that no constant (y-intercept) is to be generated. (this is called sum-to-zero constraints).
Also, run a histogram of daily sales amounts, and get the skewness value, since it's probably log-normally distributed with a right tail. When sales is right skewed, take the log_e(sales) of every day and then run a histogram and it should be more normally distributed. I would probably use the log_e(sales) as the dependent as there will be more days with lower sales than higher.
For now, try to see which independents are significant predictors of log daily sales amounts using multiple linear regression.
AI is overkill initially, since you need to know how to recode your data to solve the problem. Have you thought about recoding the output into low-high (0,1) sales days? Or even assigning each day based on sales into quartiles and then recoding each daily sales amount into $y=\{0,0,0,1\}$ based on cutpoints for the 25th, 50th, and 75th percentiles of the distribution of log_e(sales)?
You can't want an AI "sausage machine" that predicts daily sales before you master data transformations and recoding. Using AI is also going to be a problem if any features are correlated. You may also run into a problem that requires LHS (Latin hypercube sampling) for your continuous predictors, and target, since neural networks can fail miserably if the proportion of data over the range of each continuous input-output feature is jumpy(sparse/dense).