R – Scaling Differences in Linear Discriminant Analysis Coefficients

classificationdiscriminant analysisr

This is based on the example in section 8.3.2 in Izenman's Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either "malignant" or "benign."

Here is the R code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting of the data in the code below – if you need clarification on this, let me know.

(If the link is down, try https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data.)

library(MASS)
library(dplyr)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>% select(-ID)

lda(DIAGNOSIS ~., data = data)

The output is:

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
        B         M 
0.6274165 0.3725835 

Group means:
  radius.mv texture.mv  peri.mv  area.mv smooth.mv   comp.mv    scav.mv    ncav.mv   symt.mv  fracd.mv  radius.sd
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
  texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
   peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
                    LD1
radius.mv  -30.51574995
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
radius.sd   -2.50553731
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
radius.ev    6.36294870
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., "from scratch"). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
$$\hat{\mathbf{b}} \equiv \hat{\boldsymbol\Sigma}^{-1}_{XX}(\bar{\mathbf{x}}_1 – \bar{\mathbf{x}}_2)$$
where
$$\bar{\mathbf{x}}_i = n_i^{-1}\sum_{j=1}^{n_i}\mathbf{x}_{ij}$$
for $i = 1, 2$ is the vector of means ($\{\mathbf{x}_{ij}, j = 1, \dots, n_i\}$ are the samples placed in class $i$), and
$$\hat{\boldsymbol\Sigma}_{XX} = n^{-1}\mathbf{S}_{XX}$$
where
$$\mathbf{S}_{XX} = \mathbf{S}_{XX}^{(1)}+\mathbf{S}_{XX}^{(2)}$$
with
$$\mathbf{S}_{XX}^{(i)} = \sum_{j=1}^{n_i}(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)^{T}$$
for $i = 1, 2$.

data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()

mean_B = colMeans(data_B)
mean_M = colMeans(data_M)

Sigma_XX = (cov(data_B)*(nrow(data_B)-1) + cov(data_M)*(nrow(data_M)-1))/(nrow(data)-2)
solve(Sigma_XX) %*% (mean_M - mean_B)

With the output:

                   [,1]
radius.mv  -116.0451850
texture.mv   -1.4162439
peri.mv     112.2728658
area.mv       3.5843501
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
radius.sd    -9.5280483
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
radius.ev    24.1969986
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176475
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542

What am I doing wrong?

Note: You'll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $\hat{\boldsymbol\Sigma}$ an unbiased estimator. This shouldn't make much of a difference. My coefficients are off from the LDA MASS coefficients by a factor of around $3.8$. It is not clear to me why this is.

Best Answer

You have correctly computed $$\mathbf {b} = \mathbf{S}_W^{-1}(\boldsymbol\mu_1-\boldsymbol\mu_2),$$ where $\boldsymbol\mu_i$ are class means and $\mathbf{S}_W$ is the within-class pooled covariance matrix. This is indeed the LDA solution for two classes, as stated in the Izenman's textbook you referenced, as well as in The Elements of Statistical Learning (Section 4.3, Eq. 4.11). The separation line between two classes is orthogonal to the "LDA axis" $\mathbf{b}$, or in other words any point $\mathbf x$ can be projected onto this LDA axis as $\mathbf x\mathbf b^\top$ and then classified depending on the projection.

This equation for $\mathbf b$ is derived in Izenman as well as in The Elements via considering the likelihood ratio of two Gaussians. But there is another way to arrive to the same LDA axis: it can be introduced as the axis maximizing the ratio of between-class and within-class vairances (aka signal-to-noise ratio): $$\mathbf b = \operatorname*{argmax}_\mathbf{a} \frac{\mathbf{a}^\top\mathbf{S}_B\mathbf{a}}{\mathbf{a}^\top\mathbf{S}_W\mathbf{a}}.$$ Here $\mathbf{S}_B$ is between-class covariance matrix, defined in exactly the same way as within-class covariance matrix but instead of the deviations of data points from the class means, we take deviations of class means from the grand mean.

The solution to maximizing this ratio is given by the first eigenvector of $\mathbf S_W^{-1}\mathbf S_B$, and one can easily show that it's the same $\mathbf b$ as in the first equation above.

Now comes the punch-line! The signal-to-noise ratio does not depend on the length of $\mathbf{a}$, because it enters both the nominator and denominator. So when trying to find the maximum, one can fix the length as one pleases, and it turns out that it is convenient to fix it such that $\mathbf{a}^\top\mathbf{S}_W\mathbf{a}=1$ (see The Elements, Section 4.33, Eqs. 4.15-4.16).

Getting back to your code, after you computed

b = solve(Sigma_XX) %*% (mean_M - mean_B)

you can check the value of $\mathbf{b}^\top\mathbf{S}_W\mathbf{b}$:

t(b) %*% Sigma_XX %*% b

and it is equal to 14.46126. Lo and behold, its square root is 3.802796, and normalizing by it, we arrive to the lda() output:

b_lda = b / drop(sqrt(t(b) %*% Sigma_XX %*% b))

Double-check:

> drop(t(b_lda) %*% Sigma_XX %*% b_lda)
[1] 1

Izenman writes in the text that his table lists coefficients as obtained by the first formula above. This is wrong. He must have computed the coefficients via some existing LDA implementation and did not think of this normalization issue.