Solved – Sampling machine learning output to calculate confidence intervals

confidence intervalmachine learning

I need to classify large numbers of short answer, free response data from a study with a between-group design.

In order to reduce the amount of manual labor costs, I was thinking of manually coding a small sample set, running the rest of the responses through an SVM classifier, and then coding a random sample of the SVM classifier to obtain a classical statistical measures of the automatically coded data-set.

The original, and overly verbose, title to this question was, "Is applying random sampling to output from a machine learning classifier a statistically valid way to calculate confidence intervals?"

I have already done a conceptual sanity check with a friend of mine who worked with machine-learning algorithms and atmospheric modeling, but I wanted to run it past some real statisticians before I start basing my workflow around this.

Thanks!

Best Answer

In my opinion, you are kind of hinting towards the so-called bootstrap method of deriving confidence intervals, which is sound enough. The wiki says-

Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data.

Here, the output is the approximate distribution of data and randomly sampling from it is conceptually sound enough. See this link

http://en.wikipedia.org/wiki/Bootstrapping_(statistics)

Related Solutions

Solved – Calculate odds ratio confidence intervals from plink output

For the calculation of confidence intervals you'll need standard errors for the effects, but those are not available in the output. However, the standard errors can be estimated from the Wald statistics and odds ratios.

The calculation goes as follows:

Take a natural logarithm from the odds ratio. This gives you the beta from the logistic model. For example for the first row of your table: beta=ln(4.23)=1.442
The standard error for the beta is calculated by dividing the beta by the square root of the Walds statistic (STAT). Then take the absolute value of the result. Again, for the first row of your table: se=1.442/sqrt(61.5)=0.183.
The 95% confidence interval for the beta is then beta+/-1.96*se. The constant 1.96 comes from the normal distribution. Again, for the first row of data: 1.442-1.96*0.183 ... 1.442+1.96*0.183 = 1.081...1.802.
Last, you need to change the confidence interval of the beta to the confidence interval of the odds ratio. This happens simply by exponentiating the confidence interval of the beta. For the first line of data: 2.71828^1.081 = 2.949 and 2.71828^1.802 = 6.065.

So, your odds ratio for the first row of the table is 4.23 and it's 95% confidence interval is 2.949-6.065. Because the confidence interval does not include one, the results is statistically significant. The results are subject to error due to rounding of the output from PLINK.

This calculation can be achieved in, e.g., Excel, but below is also an R function that does the same thing (just in case you also use R).

# The data
or<-structure(list(SNP1 = structure(c(1L, 1L, 1L, 1L), .Label = "rs1", class = "factor"), 
SNP2 = structure(c(1L, 1L, 1L, 1L), .Label = "rs2", class = "factor"), 
HAPLOTYPE = c(22L, 12L, 21L, 11L), F = c(0.00992, 0.038, 
0.00015, 0.952), OR = c(4.23, 1.02, 5.22e-10, 0.762), STAT = c(61.5, 
0.217, 453, 22.9), P = c(4.43e-15, 0.642, 1.77e-100, 1.73e-06
)), .Names = c("SNP1", "SNP2", "HAPLOTYPE", "F", "OR", "STAT", 
"P"), class = "data.frame", row.names = c(NA, -4L))

# The function
orci<-function(or) {
   or$beta<-log(or$OR)
   or$se<-abs(or$beta/sqrt(or$STAT))
   or$lower<-or$beta-1.96*or$se
   or$upper<-or$beta+1.96*or$se
   or$LOWER<-exp(or$lower)
   or$UPPER<-exp(or$upper)
   or$res<-paste(or$OR, " (", round(or$LOWER, 3), "-", round(or$UPPER, 3), ")", sep="")
   return(or)
}

# The calculation
orci(or)

# The result
#SNP1 SNP2 HAPLOTYPE       F       OR    STAT         P         beta         se        lower       upper        LOWER        UPPER                  res
#1  rs1  rs2        22 0.00992 4.23e+00  61.500  4.43e-15   1.44220199 0.18390288   1.08175235   1.8026516 2.949844e+00 6.065710e+00    4.23 (2.95-6.066)
#2  rs1  rs2        12 0.03800 1.02e+00   0.217  6.42e-01   0.01980263 0.04251018  -0.06351733   0.1031226 9.384579e-01 1.108627e+00   1.02 (0.938-1.109)
#3  rs1  rs2        21 0.00015 5.22e-10 453.000 1.77e-100 -21.37335353 1.00420775 -23.34160072 -19.4051063 7.292419e-11 3.736538e-09 0.000000000522 (0-0)
#4  rs1  rs2        11 0.95200 7.62e-01  22.900  1.73e-06  -0.27180872 0.05679965  -0.38313603  -0.1604814 6.817202e-01 8.517337e-01  0.762 (0.682-0.852)

Solved – Prediction intervals for machine learning algorithms

To me it seems as good approach as any to quantify the uncertainties in the predictions. Just make sure to repeat all modeling steps (for a GBM that would be the parameter tuning) from scratch in every bootstrap resample. It could also be worthwile to bootstrap the importance rankings to quantify the uncertainty in the rankings.

I have found that sometimes the intervals do not contain the actual prediction, especially when estimating a probability. Increasing the minimal number of observations in each terminal node usually solves that, at least in the data that I have worked with.

Conformal prediction seems like a useful approach for quantifying the confidence in predictions on new data. I have only scratched the surface thus far and others are probably more suited to give an optinion on that.

There is some crude R-code in my reply to this post about finding a GBM prediction Interval.

Hope this helps!

Best Answer

Related Solutions

Solved – Calculate odds ratio confidence intervals from plink output

Solved – Prediction intervals for machine learning algorithms

Related Question