Solved – how to deal with independent variable of value 0 when applying log-log model

data transformationmultiple regressionrregression

I am trying to apply log-log in a marketing mix model. The dependent variable is sales, among the independent variables, one is holiday, it is dummy variable, the value is either 1(yes), or 0 (no). Another variable is competitor spend, sometimes, the value is 0.

I want to use a log-log model in R. But how can deal with variables like holiday and competitor's spend which have value of 0? I cannot remove these two variables from the model.

I already did a linear regression model in R, can anyone help me with the log-log model? (As suggested by a comment in below, I used log(x+1) in the log-log model, is it correct?)

model <- lm(SALES ~ HOLIDAY + AVERAGE_PRICE + COMPETITOR_MEDIA_SPEND
        + IMP_TV + IMP_EMAIL + IMP_PAID_SEARCH + IMP_ONLINE_DISPLAY
        + IMP_PRODUCT_SEARCH, data = mmm)


model2 <- lm(log(SALES) ~ log(HOLIDAY+1) + log(AVERAGE_PRICE) 
         + log(COMPETITOR_MEDIA_SPEND+1)+ log(IMP_TV) + log(IMP_EMAIL) 
         + log(IMP_PAID_SEARCH) + log(IMP_ONLINE_DISPLAY)
         + log(IMP_PRODUCT_SEARCH), data = mmm)

Best Answer

If $\mathit{HOLIDAY}_t$ is a binary, indicator variable, then there's absolutely no reason to compute $X_t = \log(1 + \mathit{HOLIDAY}_t)$. Just stick the indicator $\mathit{HOLIDAY}_t$ on the right hand side of the regression.

Table of values:

$$\begin{array}{ccc} \text{Is day $t$ a holiday?} & \mathit{HOLIDAY}_t & \log(1 + \mathit{HOLIDAY}_t) \\ \text{no} & 0 & 0 \\ \text{yes} & 1 & \log(2) \end{array}$$

You're basically creating an indicator variable where it takes the value $\log(2)$ (which is $\approx .6931$) if the condition is true. IMHO, this is bizarre.

Two equivalent regressions:

A regression of: $$y_t = a + b_1 \mathit{HOLIDAY}_t + \epsilon_t$$

is equivalent to a regression of: $$y_t = a + b_2 \log(1+\mathit{HOLIDAY}_t) + \epsilon_t$$

in the sense that your estimate $\hat{b}_2 = \frac{\hat{b}_1}{\log(2)}$ since $\log(1 + \mathit{HOLIDAY}_t) = \log(2)\mathit{HOLIDAY}_t$.

Conclusion (weirdo transforms hurt interpretability)

If you ran the regression $\log(Sales_t) = a + b \, \mathit{HOLIDAY}_t + \epsilon_t$ and got an estimate for $b$ of .02, you would basically conclude that sales are 2 percent higher on holidays.

If you ran the regression $\log(Sales_t) = a + b \log(1 + HOLIDAY_t) + \epsilon_t$, you would then get an estimate of $.02 / \log(2)$ = .0289, which has absolutely no meaningful interpretation.

Related Question