Solved – Linear discriminant analysis in R

discriminant analysisr

I have two matrices, both 46175 * 741 (Rows of variables by columns of individuals/observations). Matrix A contains a categorical (perhaps dependent) variable (0/1/2 or NA) and Matrix B is continuous and independent (Ranging from 0 to a couple 100).

I want to see if there is a relationship between these data. Firstly, am I correct in thinking that LDA in R is a valid way to test this? If so, how exactly do I run this?

z <- lda(data= MatrixA , x= MatrixB, grouping=MatrixB)

is the closest I've gotten, but it doesn't work. I get:

 z <- lda(data= MatrixA , x= MatrixB, grouping=MatrixB)
Error in lda.default(x, grouping, ...) : 
  nrow(x) and length(grouping) are different

Data Snippet:

 Matrix A -------------------------   
 SampleA    SampleB    SampleC          
   NA          0           1
   NA          NA          NA
   1           2           0
   0           0           0 

  Matrix B -----------------------
  SampleA    SampleB    SampleC          
   0          0           0
   83         124         56
   39         45          5
   12714      12477       8751 

The matrices contain data on the same individiuals, in the same order of columns and rows. Matrix A contains genotypes (genetic information) that is either 0/1/2 or could not be obtained (NA). MatrixB is the number of reads aligned to that region. Zero in this is therefore not the same as zero in MatrixA and is more similar to its NA.

Best Answer

There are several problems here.

  • Each row should correspond to one case/individual; each column to one variable.
    If I understand your description correctly, that means you need to transpose your data.

  • This also means that you have more variates than individuals, thus the variance-covariance matrix is not of full rank which leads to problems during its inversion inside lda.
    You need to drastically reduce the number of variates or increase the number of individuals before performing LDA (if I correctly understood your description of the data).

  • MASS::lda expects grouping to be a factor with one value per case (= row), not a matrix.
    That's why it is complaining that length (grouping) is not the same as nrow (x)

  • It does not make any sense to give the same data for x and grouping: x should be the matrix with the independent variates, grouping is the dependent.

  • It is very unusual to give x, grouping and data.
    Either give data and formula: with that you call the formula interface (lda.formula).
    Or give x and grouping: that calls lda.default (a bit faster than the first option).


edit:

The formula version lda (grouping ~ x) is equivalent to lda (x = x, grouping = grouping). If you have a data.frame data with columns x and grouping, then you'd use lda (grouping ~ x, data = data). Note that a column of a data.frame can hold a whole matrix.

Related Question