Solved – Box Cox Transformation with swift

data transformationrregressionregression-strategies

I am trying to do a box-cox transformation with swift. I have a dependent variable, annual foreign sales of companies (in US\$ thousands) which contains zeros, for a set of panel data. I have been advised to add a small amount, for example, 0.00001 to the annual foreign sales figures so that I can take the log, but I think box-cox transformation will produce a more appropriate constant than 0.00001. I have done a box-cox transformation on R with the codes below, but it has given me a very large lambda2 of 31162.8.

library(geoR)
boxcoxfit(bornp$ForeignSales, lambda2 = TRUE)
#R output - Fitted parameters:
# lambda lambda2 beta sigmasq 
# -1.023463e+00 3.116280e+04 9.770577e-01 7.140328e-11

My hunch is that the above value of lambda2 is very large, so I am not sure if I need to run the boxcoxfit with my independent variables like below:

boxcoxfit(bornp$ForeignSales, bornp$family bornp$roa bornp$solvencyratio,lambda2=TRUE)

I am still trying to identify the best set of independent variables, so I am not sure if using the boxcoxfit with independent variables at this stage will work or is best.

Here's the description of the two lambda parameters from the help:

lambda       numerical value(s) for the transformation parameter $\lambda$. Used as the initial value
             in the function for parameter estimation. If not provided default values are as-
             sumed. If multiple values are passed the one with highest likelihood is used as
             initial value.
lambda2      logical or numerical value(s) of the additional transformation (see DETAILS
             below). Defaults to NULL. If TRUE this parameter is also estimated and the initial
             value is set to the absolute value of the minimum data. A numerical value is
             provided it is used as the initial value. Multiple values are allowed as for
             lambda.

Best Answer

By reading documentation from the R geoR package we can find that the boxcox transform with the extra parameter $\lambda_2$ is defined as $$ Y' =\begin{cases} \log(Y+\lambda_2) \text{if $\lambda=0$} \\ \frac{(Y+\lambda_2)^\lambda -1}{\lambda} \text{otherwise} \end{cases} $$ so if $\lambda_2=0$ this is the usual boxcox transform, and the boxcoxfit function will estimate the two parameters $\lambda, \lambda_2$ by maximum likelihood. From the examples on the help page it seems like the boxcoxfit function can take as first argument either a vector of data values or a model object, then presumably using the residuals from the fit. From your code it seems like you have just given a data vector, thus finding the boxcox transformation parameters based on the marginal distribution of the response variable. That is usually less useful than using the residuals from the model fit, so you should reanalyze doing that!

You asks then why the $\lambda_2=\text{3.116280e+04}$ is so large, if that is reasonable? Well, we cannot judge that, since you did'nt tell us about the marginal distribution of your $Y$ variable! But, you told us that $Y=\text{annual foreign sales of companies (in US\$ thousands)}$ and I would guess that contains many large values, so in that context maybe the estimated $\lambda_2$ is not so large.