Solved – How to pre-process data for partial least square PLS regression in R

lognormal distributionnormal distributionpartial least squaresrstandardization

I have a data frame that is consisted of 20 observations and 35 variables.

I want to prepare the data for partial least square regression PLS in R.

Many authors suggest:

  1. Check whether the variables are normally distributed or not

  2. log-transform variables that are not normally distributed

  3. center data

  4. scale data (standardize data)

I checked the normal distribution of the variables using Shapiro-Wilks test and then I log transformed the variables that are not normally distributed.

My questions are:

  1. should I standardize log transformed data or the original dataset?

  2. Is there any R package that pre-process data for pls?

Best Answer

Standardize the log transformed data, if you are going to tranform, not the original dataset. The goal of standardizing (aka auto-scaling) is to have mean zero and variance 1 data input into PLSR, so as to force a y=0 intercept in model, and to have PLSR initially weigh all variables equally, respectively. Auto-scaling can introduce noise, however, so it is not always a good option. There are other types of scaling that are compromises between auto-scaling and just centering, such as pareto scaling, where you divide by the square root of standard deviation before centering. If you do not scale, larger magnitude variables will be weighed more heavily in PLSR than lower magnitude variables.

Log transformations are often used when your x variables are e.g. chemical concentrations, that are known to be lognormally distributed in the environment. But in other applications, (e.g. spectroscopy), log transformation may not make the most sense. There are other transformation options, such as the rank-based inverse normal transformation, Box-Cox, or Tukey that you could consider, that will almost absolutely force a normal result. What matters is that when you run PLSR, the relationship between the x-scores and y-scores are linear (t vs u plots). If you see a lot of curvature in these plots, you could probably obtain better results if you were to transform something.