Solved – How Gower’s dissimilarity handle missing values in numeric columns

clusteringdistancegower-similaritymissing datar

I would like to ask a question about Gower dissimilarity, I was wondering how Gower measure handle missing values in numeric columns, especially that Gower standardized each column based on the range of the same attribute ?

I have read both details of functions daisy and gower.dist in R and their original source (chapter 1 of Kaufman and Rousseeuw (1990)) but I got confuse.
http://www.inside-r.org/packages/cran/StatMatch/docs/gower.dist
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html

I tried also to look at similar discussions in this website
Gower distance and MDS: How to determine which variables count?
but I did not find an answer.

also imputing the data with a dummy/mean values are not an option for me, I need my data as it is. my data is students' exams marks.

Best Answer

It's your choice. There is no "correct" way.

The most "correct" way would be the work with two similarities. An upper bound and a lower bound.

Consider this toy example:

dist(  [A, B],  [C,?] )

if the missing value is D then you get a similarity of 0, that is your worst case. But if the missing value is B, and say you don't have any other records with a B and no A either, then it even could be the most similar object.

But then you would need algorithms that can handle this well, and I don't know of any.

A popular approach is missing value imputation. By replacing missing values (at least temporarily) with your best estimate, you are often closest to the real result.

Another popular approach is to ignore records with missing data.

Related Solutions

Solved – How does R handle missing values in lm

Edit: I misunderstood your question. There are two aspects:

a) na.omit and na.exclude both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals() or fitted() will pad their output with NAs for the omitted cases with na.exclude, thus having an output of the same length as the input variables.

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b) The real issue is not with this difference between na.omit and na.exclude, you don't seem to want casewise deletion that takes criterion variables into account, which both do.

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$, coefficients $\hat{\beta} = X^{+} Y$) and the hat matrix $H = X X^{+}$, fitted values $\hat{Y} = H Y$). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$, so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

Note that there might be numerically better ways to caculate $X^{+}$ and $H$, you could check a $QR$-decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().

Solved – Handle missing values in factor variable

Option 1 is an alternative that must be considered, but there are other approaches, and combinations of approaches. Each column which possesses missing values must be treated individually.

The decision of how do deal with each column will depend on many factors: the meaning of the column, proportion of missing values, nature of missing values (if it's a categorical variable, a missing value can be even very informative to predict the response variable), etc. There is no "default" treatment. We need specific information to give specific advise.

You should deal with it as systematically as possible:

List all columns which have missing values.
Determine the proportion of missing values in each column.
Choose standard candidate approaches for each column (list-wise deletion, mean imputation, regression imputation, etc.).
Evaluate the best approaches (you could for example train your classifier with two different approaches and evaluate them in a validation set).

There are lots of advanced approaches. Everything said above apply to them anyway. Googling "missing data" will give you many more insights.

[Edit: comment about "option 2" removed, because the original question was modified and the comment is not applicable anymore]

Best Answer

Related Solutions

Solved – How does R handle missing values in lm

Solved – Handle missing values in factor variable

Related Question