Small Sample Size Estimation – Understanding Parameters’ Uncertainty for Small Sample Size

confidence intervalestimationgoodness of fitlognormal distributionsmall-sample

Suppose we have a small set of numbers (5 to 10 observations), and we’re trying to fit a distribution to this set. Also, we know that all numbers are positive. I tried to fit lognormal, but I’m not sure how good my estimates are since the sample is very small; also I’m not sure how whether it is enough to look at goodness-of-fit test due to the small sample size.

Any suggestions on how to tackle this issue (i.e., to be more confident, certain, about my estimates)?

Best Answer

I would not recommend using a goodness of fit test for such small sample. For example, if you simulate $5-10$ observations from a log-normal distribution, then the Shapiro-Wilk normallity test would fail in the sense that the associated p-value would be higher than $0.05$ more than $30\% +$ of the times, failing to provide the desired power/signficance level. See the following R code.

count = rep(0,10000)

for(i in 1:10000){
x = exp(rnorm(10))
if(shapiro.test(x)$p.value>0.05) count[i] = 1 
}

mean(count)

You might consider Maximum Likelihood Estimation (MLE) and quantifying the accuracy of the estimation by constructing confidence-likelihood intervals for the parameters. One option consists of using the profile likelihood of the parameters $(\mu,\sigma)$.

In this case, the MLE of $(\mu,\sigma)$ for a sample $(x_1,...,x_n)$ are

$$\hat\mu= \dfrac{1}{n}\sum_{j=1}^n\log(x_j);\,\,\,\hat\sigma^2=\dfrac{1}{n}\sum_{j=1}^n(\log(x_j)-\hat\mu)^2.$$

Now, you can use the well-known result that a likelihood interval of level $0.147$ has an approximate confidence of $95\%$. The following R code shows how to calculate these intervals for $\mu$ and $\sigma$ numerically and how to plot the profile likelihoods for your sample.

# Your data
dat = c(0.6695,0.5968, 0.7641, 0.7252, 0.7779)
n = length(dat)

# Profile likelihood of mu
p.mu = function(mu){
muh = mean(log(dat))
return(  (sum((log(dat)-muh)^2)/sum((log(dat)-mu)^2))^(0.5*n)  )
}

# Plot of the profile
vec = seq(-0.75,0,0.01)
rmvec = lapply(vec,p.mu)

plot(vec,rmvec,type="l")

p.muint = function(mu) return(p.mu(mu)-0.147)

# Approximate 95% confidence interval of mu
c(uniroot(p.muint,c(-0.6,-0.4))$root,uniroot(p.muint,c(-0.3,-0.1))$root)


# Profile likelihood of sigma
p.sigma = function(sigma){
muh = mean(log(dat))
sigmah = sqrt(mean((log(dat)-muh)^2))
return(  (sigmah/sigma)^n*exp(0.5*n)*exp(-0.5*n*sigmah^2/sigma^2) )
}


# Plot of the profile
vec1 = seq(0.01,0.3,0.001)
rsvec = lapply(vec1,p.sigma)

plot(vec1,rsvec,type="l")

p.sigmaint = function(sigma) return(p.sigma(sigma)-0.147)

# Approximate 95% confidence interval of sigma
c(uniroot(p.sigmaint,c(0.05,0.1))$root,uniroot(p.sigmaint,c(0.15,0.3))$root)

I hope this helps.