Standardizing Features for LDA – Effective Data Pre-processing Techniques

classificationdata transformationdiscriminant analysisnormalizationstandardization

If a multi-class Linear Discriminant Analysis (or I also read Multiple Discriminant Analysis sometimes) is used for dimensionality reduction (or transformation after dimensionality reduction via PCA), I understand that in general a "Z-score normalization" (or standardization) of features won't be necessary, even if they are measured on completely different scales, correct? Since LDA contains a term similar to the Mahalanobis distance which is already implying normalized Euclidean distances?

So it would not only be not necessary, but the results after an LDA on standardized and non-standardized features should be exactly the same!?

Best Answer

The credit for this answer goes to @ttnphns who explained everything in the comments above. Still, I would like to provide an extended answer.

To your question: Are the LDA results on standardized and non-standardized features going to be exactly the same? --- the answer is Yes. I will first give an informal argument, and then proceed with some math.

Imagine a 2D dataset shown as a scatter plot on one side of a balloon (original balloon picture taken from here): LDA on a baloon

Here red dots are one class, green dots are another class, and black line is LDA class boundary. Now rescaling of $x$ or $y$ axes corresponds to stretching the balloon horizontally or vertically. It is intuitively clear that even though the slope of the black line will change after such stretching, the classes will be exactly as separable as before, and the relative position of the black line will not change. Each test observation will be assigned to the same class as before the stretching. So one can say that stretching does not influence the results of LDA.

Now, mathematically, LDA finds a set of discriminant axes by computing eigenvectors of $\mathbf{W}^{-1} \mathbf{B}$, where $\mathbf{W}$ and $\mathbf{B}$ are within- and between-class scatter matrices. Equivalently, these are generalized eigenvectors of the generalized eigenvalue problem $\mathbf{B}\mathbf{v}=\lambda\mathbf{W}\mathbf{v}$.

Consider a centred data matrix $\mathbf{X}$ with variables in columns and data points in rows, so that the total scatter matrix is given by $\mathbf{T}=\mathbf{X}^\top\mathbf{X}$. Standardizing the data amounts to scaling each column of $\mathbf{X}$ by a certain number, i.e. replacing it with $\mathbf{X}_\mathrm{new}= \mathbf{X}\boldsymbol\Lambda$, where $\boldsymbol\Lambda$ is a diagonal matrix with scaling coefficients (inverses of the standard deviations of each column) on the diagonal. After such a rescaling, the scatter matrix will change as follows: $\mathbf{T}_\mathrm{new} = \boldsymbol\Lambda\mathbf{T}\boldsymbol\Lambda$, and the same transformation will happen with $\mathbf{W}_\mathrm{new}$ and $\mathbf{B}_\mathrm{new}$.

Let $\mathbf{v}$ be an eigenvector of the original problem, i.e. $$\mathbf{B}\mathbf{v}=\lambda\mathbf{W}\mathbf{v}.$$ If we multiply this equation with $\boldsymbol\Lambda$ on the left, and insert $\boldsymbol\Lambda\boldsymbol\Lambda^{-1}$ on both sides before $\mathbf{v}$, we obtain $$\boldsymbol\Lambda\mathbf{B}\boldsymbol\Lambda\boldsymbol\Lambda^{-1}\mathbf{v}=\lambda\boldsymbol\Lambda\mathbf{W}\boldsymbol\Lambda\boldsymbol\Lambda^{-1}\mathbf{v},$$ i.e. $$\mathbf{B}_\mathrm{new}\boldsymbol\Lambda^{-1}\mathbf{v}=\lambda\mathbf{W}_\mathrm{new}\boldsymbol\Lambda^{-1}\mathbf{v},$$ which means that $\boldsymbol\Lambda^{-1}\mathbf{v}$ is an eigenvector after rescaling with exactly the same eigenvalue $\lambda$ as before.

So discriminant axis (given by the eigenvector) will change, but its eigenvalue, that shows how much the classes are separated, will stay exactly the same. Moreover, projection on this axis, that was originally given by $\mathbf{X}\mathbf{v}$, will now be given by $ \mathbf{X}\boldsymbol\Lambda (\boldsymbol\Lambda^{-1}\mathbf{v})= \mathbf{X}\mathbf{v}$, i.e. will also stay exactly the same (maybe up to a scaling factor).

Best Answer

Related Solutions

Solved – Does PCA followed by LDA make sense, when there is more data available for PCA than for LDA

Classification – Combining PCA and LDA: Does It Make Sense?

Illustration

Related Question