Solved – should I use na.omit or na.exclude in a linear model (in R)

linear modelmissing datarregressionresiduals

I try to understand the difference between using different na.actions (na.omit and na.exclude) to handle missing data in a linear model using R. I used the lm function in R (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html).

The summary of my linear model prints exactly the same for na.omit and na.exclude. But if I look at the residuals, the na.exclude gives me NA' s for the cases where there was an NA and it gives me a list of residuals without NA for the na.omit.

residuals.lm(lm_1_exclude)

 1            2            3            4            5            6            7            8            
-0.052657302 -0.093045084 -0.329509087  NA -0.152040821 -0.321757328 -0.322085368  0.072296134  

residuals.lm(lm_1_omit)

 1            2            3              5            6            7            8            
-0.052657302 -0.093045084 -0.329509087   -0.152040821 -0.321757328 -0.322085368  0.072296134

I now would like to understand, which option should be preferred in a linear model and how this affects my statistics.. unfortunately, I only could find tutorials explaining that these different options exist, but no advice on what to choose. https://stats.idre.ucla.edu/r/faq/how-does-r-handle-missing-values/

Best Answer

The only benefit of na.exclude over na.omit is that the former will retain the original number of rows in the data. This may be useful where you need to retain the original size of the dataset - for example it is useful when you want to compare predicted values to original values. With na.omit you will end up with fewer rows so you won't as easily be able to compare.

Related Solutions

Solved – How Gower’s dissimilarity handle missing values in numeric columns

It's your choice. There is no "correct" way.

The most "correct" way would be the work with two similarities. An upper bound and a lower bound.

Consider this toy example:

dist(  [A, B],  [C,?] )

if the missing value is D then you get a similarity of 0, that is your worst case. But if the missing value is B, and say you don't have any other records with a B and no A either, then it even could be the most similar object.

But then you would need algorithms that can handle this well, and I don't know of any.

A popular approach is missing value imputation. By replacing missing values (at least temporarily) with your best estimate, you are often closest to the real result.

Another popular approach is to ignore records with missing data.

Solved – use the CLR (centered log-ratio transformation) to prepare data for PCA

You might experience some issues with vanilla PCA on CLR coordinates. There are two major problems with compositional data:

they are strictly non-negative
they have a sum constraint

Various compositional transforms address one or both of these issues. In particular, CLR transforms your data by taking the log of the ratio between observed frequencies ${\bf x}$ and their geometric mean $G({\bf x})$, i.e.

$$ \hat{\bf{x}} = \left \{ \log \left (\frac{{x}_{1}}{G({\bf x})} \right), \dots, \log \left (\frac{{x}_{n}}{G({\bf x})} \right) \right \} = \left\{ \log ({x}_{1}) - \log( G({\bf x}) ) , \dots ,\log({x}_{n}) - \log( G({\bf x})) \right\} $$

Now, consider that

$$ \log (G({\bf x} ))=\log \left( \exp \left[ \frac { 1 }{ n } \sum _{ i=1 }^{ n }{ \log ({ x }_{ i }) } \right] \right) = \mathop{\mathbb{E}}\left[ \log({\bf x}) \right] $$

This effectively means that $$ \sum{\hat{{\bf x}}} = \sum{ \left [ \log({\bf x}) - \mathop{\mathbb{E}}\left[ \log({\bf x}) \right] \right ]} = 0 $$

In other words CLR removes the value-range restriction (which is good for some applications), but does not remove the sum constraint, resulting in a singular covariance matrix, which effectively breaks (M)ANOVA/linear regression/... and makes PCA sensitive to outliers (because robust covariance estimation requires a full-rank matrix). As far as I know, of all compositional transforms only ILR addresses both issues without any major underlying assumptions. The situation is a bit more complicated, though. SVD of CLR coordinates gives you an orthogonal basis in the ILR space (ILR coordinates span a hyperplane in CLR), so your variance estimations will not differ between ILR and CLR (that is of course obvious, because both ILR and CLR are isometries on the simplex). There are, however, methods for robust covariance estimation on ILR coordinates [2].

Update I

Just to illustrate that CLR is not valid for correlation and location-dependant methods. Let's assume we sample a community of three linearly independent normally distributed components 100 times. For the sake of simplicity, let all components have equal expectations (100) and variances (100):

In [1]: import numpy as np

In [2]: from scipy.stats import linregress

In [3]: from scipy.stats.mstats import gmean

In [4]: def clr(x):
   ...:     return np.log(x) - np.log(gmean(x))
   ...: 

In [5]: nsamples = 100

In [6]: samples = np.random.multivariate_normal(
   ...:     mean=[100]*3, cov=np.eye(3)*100, size=nsamples
   ...: ).T

In [7]: transformed = clr(samples)

In [8]: np.corrcoef(transformed)
Out[8]: 
array([[ 1.        , -0.59365113, -0.49087714],
       [-0.59365113,  1.        , -0.40968767],
       [-0.49087714, -0.40968767,  1.        ]])

In [9]: linregress(transformed[0], transformed[1])
Out[9]: LinregressResult(
   ...:     slope=-0.5670, intercept=-0.0027, rvalue=-0.5936, 
   ...:     pvalue=7.5398e-11, stderr=0.0776
   ...: )

Update II

Considering the responses I've received, I find it necessary to point out that at no point in my answer I've said that PCA doesn't work on CLR-transformed data. I've stated that CLR can break PCA in subtle ways, which might not be important for dimensionality reduction, but is important for exploratory data analysis. The paper cited by @Archie covers microbial ecology. In that field of computational biology PCA or PCoA on various distance matrices are used to explore sources of variation in the data. My answer should only be considered in this context. Moreover, this is highlighted in the paper itself:

... The compositional biplot [note: referring to PCA] has several advantages over the principal co-ordinate (PCoA) plots for β-diversity analysis. The results obtained are very stable when the data are subset (Bian et al., 2017), meaning that exploratory analysis is not driven simply by the presence absence relationships in the data nor by excessive sparsity (Wong et al., 2016; Morton et al., 2017).

Gloor et al., 2017

Update III

Additional references to published research (I thank @Nick Cox for the recommendation to add more references):

Best Answer

Related Solutions

Solved – How Gower’s dissimilarity handle missing values in numeric columns

Solved – use the CLR (centered log-ratio transformation) to prepare data for PCA

Related Question