Solved – Using bootstrap to estimate the 95th percentile and confidence interval for skewed data

bootstrapconfidence intervalquantilessamplingskewness

The problem:

I have data of sales per day during a certain period (n=7939). The data is rather skewed (see the first image below). I would like to propose the number of items to resupply every day such that for 95% of the days there is enough stock. It is given that the stock is resupplied every day and a big surplus of stock is considered waste. We can disregard seasonal influences.

Picture of the data:
enter image description here

Approach:
Since I want to consider the number of items that would satisfiy 95% of the days, I simply took the 95th percentile of the sales. However, taking this percentile gives us no idea of the confidence interval. I felt there must be a more solid approach.

To resolve this, I applied bootstrap resampling and computed the 95th percentile of each sample. I performed this operation a thousand times. The idea is to make use of the central limit theorem. With normally distributed sample percentiles I can provide the mean and the 0.025 and 0.975 percentile of the sample percentiles as confidence interval.

However, the problem is that my data is very "binned" and I don't know where to go from here.

Basically I have two questions:

  • Given my problem, is this a valid approach? Is there a different approach that might be more suitable for this problem?
  • Is the binned distribution a problem for assuming normality (and thus provide the confidence around the mean 95th percentile).

enter image description here

Best Answer

Maybe you don't have to adopt normal distribution. Why don't you just use the 2.5% percentile and 97.5% percentile of boot strap sample percentiles as the confidence interval? I simulated usual bootstrap method and it seems work when comparing to the method using binomial distribution.

I don't have your data so I made some data from gamma distribution which is skewed.

#making data
set.seed(1)
x<-rgamma(7000,5,0.3)
hist(x)

x_sorted<-sort(x)

x_sorted[round(7000*0.95)] # estimate of 95% x

This is the bootstrap code I ran.

#method1 bootstrap
bootx95p(x_sorted,1000,0.05)

bootx95p<-function(x,b,alpha){
# x is your data
# b is the number of bootstraps.
# alpha is your type I error

n<-length(x)
p<-round(n*0.95)
xp<-rep(0,b)

for(i in 1:b){
x_boot<-sample(x,n,replace=TRUE)
x_boot<-sort(x_boot)
xp[i]<-x_boot[p]
}

xp<-sort(xp)
a<-round(alpha/2*(b+1))
CIup<-xp[b-a+1]
CIlow<-xp[a]

cat(' CI (',CIlow,', ',CIup,')','\n')
hist(xp)

}

The estimate would be 30.56664 and this is the result of the bootstrap method : CI ( 30.0623 , 31.08694 ) The below is the histogram of the distribution of 95th percentile of sample percentiles acquired from bootstrap method. enter image description here

And this is the method you also suggested using binomial distribution.

 #2 using bionomial
    up=qbinom(0.975,7000,0.95)
    low=qbinom(0.025,7000,0.95)
    x_sorted[up]
    x_sorted[low]

The result is quite similar :

> x_sorted[up]
[1] 31.08901
> x_sorted[low]
[1] 30.04189

As someone may have noticed from my English, I am not a native English speaker and even learning English. So It would be appreciated if someone correct my grammar.