Solved – How to deal with datasets that have many values out of range / over threshold

censoringrregressionthreshold

I have a dataset of genomic information which I'm going to be comparing with various biochemical markers. Unfortunately a lot of the biochemical markers have limited ranges in their assays, so I have a lot of data that looks like "40", ">45", "35", ">45" for tests that have a threshold at 45 (for example).
My intended analysis for most of this data is linear regression in R. So what is the statistically correct way to deal with this data?

  1. Ignore it, let R cast the values with ">" to NA and potentially lose information about important associations

  2. Make the over threshold values equal to the threshold. This has similar problems to 1)

  3. It depends. Sigh. Could you please give me some pointers as to what other considerations I should be thinking about or information you might need to answer my question?

Edit: Based on the comments I've given more information about my datasets. The values which are out of range (GFR and Fol) are independent variables which I'll use in linear regression like so:

lm(H~allele+Age+Sex+as.double(GFR)+as.double(Fol))

GFR looks like:

summary(as.double(GFR)) 
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
31.00   70.00   77.00   75.66   83.00  100.00  105.00

and appears to be normally distributed:

V = qqnorm(na.omit(as.double(GFR))
cor(V$x, V$y)
[1] 0.9911351

There are 105 values coded as ">90" (not sure why the summary said Max is 100) out of 434.

Fol is distributed like so:

summary(as.double(Fol))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
6.10   23.20   29.80   29.14   35.70   45.30    8.00

and also appears to be normally distributed:

V  = qqnorm(na.omit(as.double(Fol)))   
cor(V$x, V$y)
[1] 0.9911351

There are 8 out of 434 variables in Fol coded are ">45.3". I took my cue for calling these normally distributed from this assessment of normality guide ).

I also have another variable CRP which is a dependent variable, which I'd like to do linear regression on similarly to the above. CRP has 11 out of 434 coded as "<0.2". Its distribution is:

summary(as.double(CRP))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.200   0.600   1.300   2.674   2.650 112.400  11.000

The data graphed is clearly not normaly and it has a correlation with qqnorm of 0.5153663. The value of 112 is a clear outlier.

I hope that makes it more clear. Please let me know if you need more information. Thanks for your help.

Best Answer

The most correct way to handle this is to model the probability of overshooting the threshold separately (note: handling the overshoots as NA would put them at the same position as missing data, which is also very common in biomarkers, but need a whole different handling. Either way this kind of 'missing data' is, if I might coin the term, 'missing completely not at random'). This is not an easy undertaking. A colleague of mine is working on this, and has already shown that different results can be obtained with the correct analysis.

Apart from that: if you are not aiming high (in statistical correctness), I fear that the 'accepted standard', in many fields, for this kind of situation is indeed one of your first two options. I may not like it, but there are worse 'accepted practices' around. Check the literature of your field of interest to see what others do, or chose to do hard work on getting elaborate models to fit.

Related Question