1. LDA or logistic regression?
LDA and logistic regression can both be used to 'predict' the class of a subject, both can handle the case of more than two classes.
They both differ in the way of solving the classification problem and therefore they make different assumptions: logistic regression assumes the well-known S-shape, while LDA assumes that in each class your data are (1) multivariate normal and (2) with the same var-covar matrix in each class.
If the assumptions of multivariate normality and same var-covar are fulfilled then, in general, LDA will perform better.
2. Intuition behind LDA
You have 'subjects' that are characterized by features $x_1, x_2, \dots x_n$. The goal is to decide on the class of the subject, knowing the value of its features.
As said, LDA assumes that, in each class '$c$', your features have a multivariate distribution with a mean that depends on the class, so $\mu_c$ (note that this is a vector) and var-covar $\Sigma$ ( the same for all the classes), so for each class we know the multivariate density $\Phi_c(\mu_c,\Sigma)$ that allows us to calculate the probabilities.
Now, given the features, we can compute the $\Phi_c$ for the featurs $x_i$ and we will put the subject in that class $c$ where this yields the highest value (i.e. Where the 'probability' is heighest).
3. Why is LDA a dimension reduction technique?
LDA assumes multivariate normality in each class with the same var-covar. Therefore the classes are all the 'same' except for their mean. So you can 'feel' that the number of means will be important.
If you have $n$ features, then, in the end, the solution will depend on $C$ means, $C$ being the number of classes. In fact it can be show mathematically that LDA 'solves' the classification problem in a 'subspace' of the $n$-dimensional feature space, and that this subspace has dimension that is lower than $C$.
To make it more concrete, assume that you have subjects with $25$ features and you want to classify them in two classes, then you can 'solve' the problem in a one-dimesnional space (thus on a line). This is why LDA is said to be a dimension reduction technique, in this case it reduces the dimension from $25$ to one .
it can be shown that this is equivalent to assigning the observation to the class for which the below equation is largest
Deriving the discriminant function for LDA
For LDA we assume that the random variable $X$ is a vector $\mathbf{X} = (X_1,X_2,...,X_p)$ which is drawn from a multivariate Gaussian with class-specific mean vector and a common covariance matrix $\Sigma$. In other words the covariance matrix is common to all $K$ classes: $Cov(X) = \Sigma$ of shape $p \times p$
Since $x$ follows a multivariate Gaussian distribution, the probability $p(X = x | Y = k)$ is given by:
$$ f_k(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)$$
($\mu_k$ is the mean of inputs for category $k$)
Assume that we know the prior distribution exactly: $P(Y = k) = \pi_k$, then we can find the posterior distribution using Bayes theorem as
$$ p_k(x) = p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k $$
There is the summation term which remains , how we do we go about eliminating it?
If you look at the term in which there is a summation, it is actually equal to $P(X = x)$ in the equation above. Since $P(X = x)$ does not depend on $k$ and we are only interested in the terms which are function of $k$ (see later) we can push it into a constant $C$.
We will now proceed to expand and simplify the algebra, putting all constant terms into $C, C', C''$ etc..
\begin{aligned}
p_k(x) &= p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k
\\
& = C \ \pi_k \ \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)
\\
& = C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)
\end{aligned}
Take the log of both sides:
\begin{aligned}
\log p_k(x) &= \log ( C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) )
\\
& = \log C' + \log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)
\end{aligned}
Since the term $\log C'$ does not depend on $k$ and we aim to maximize the posterior probability over $k$, we can ignore it:
\begin{aligned}
\log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)
\\
= \log \pi_k - \frac{1}{2} [ x^T \Sigma^{-1} x + \mu^T_k \Sigma^{-1} \mu_k ] + x^T \Sigma^{-1} \mu_k
\\
= C'' + \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k
\end{aligned}
And so the objective function, sometimes called the linear discriminant function is:
$$ \delta_k(x) = \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k $$
Which means that given an input $x$ we predict the class with the highest value of $\delta_k(x)$.
See here for an implementation in Python
Best Answer
There are several problems here.
Each row should correspond to one case/individual; each column to one variable.
If I understand your description correctly, that means you need to transpose your data.
This also means that you have more variates than individuals, thus the variance-covariance matrix is not of full rank which leads to problems during its inversion inside
lda
.You need to drastically reduce the number of variates or increase the number of individuals before performing LDA (if I correctly understood your description of the data).
MASS::lda
expects grouping to be a factor with one value per case (= row), not a matrix.That's why it is complaining that
length (grouping)
is not the same asnrow (x)
It does not make any sense to give the same data for
x
andgrouping
:x
should be the matrix with the independent variates,grouping
is the dependent.It is very unusual to give
x
,grouping
anddata
.Either give
data
andformula
: with that you call the formula interface (lda.formula
).Or give
x
andgrouping
: that callslda.default
(a bit faster than the first option).edit:
The formula version
lda (grouping ~ x)
is equivalent tolda (x = x, grouping = grouping)
. If you have a data.framedata
with columnsx
andgrouping
, then you'd uselda (grouping ~ x, data = data)
. Note that a column of a data.frame can hold a whole matrix.