Solved – Using chisq.test in R to measure goodness of fit of a fitted distribution

chi-squared-testgoodness of fitpoisson distributionr

I have a frequency table of claims made by policyholders:

Claims       0,     1,    2,    3,   4
Frequency  2962,  382,   47,   25,   4

And I am going to carry out a chi-squared goodness of fit test to see if it conforms to a Poisson distribution (there are probably far better methods – but I'm teaching basic stats – so go with the flow please).

I converted the frequency table into a vector as follows:

n<-c(0,1,2,3,4)
x<-c(2962,382,47,25,4)
data <- rep(n,x)

I then fitted a Poisson(mu) as follows:

library(MASS)
fitted <-fitdistr(data,"Poisson")
mu<-fitted$estimate

I obtained the expected probabilities under this model using:

exptd<-c(dpois(0,mu),dpois(1,mu),dpois(2,mu),dpois(3,mu),1-ppois(3,mu))

Using round(sum(x)*exptd,3) I noticed that the expected frequency of the last two groups were both <5 and so I combined the last three groups together:

x2<-c(x[1],x[2],sum(x[3:5]))
exptd2<-c(exptd[1],exptd[2],sum(exptd[3:5]))

I now carried out my chi-squared goodness of fit test:

chisq.test(x2,p=exptd2)

I have two issues:

  1. The chi-square test has the wrong degrees of freedom as I estimated a parameter on the data.
  2. This seems dreadfully long-winded – there must be a quicker way of doing this test.

Unfortunately despite searching I can't find examples of this online (which makes me think that this method isn't used despite what our textbooks say) so I need your help.

Many thanks in advance for the help.

Best Answer

User Defined Function can be used for such requirements

I tried to make a naive function for mentioned purpose. you can expand this.

chi2test=function(data,model,dens,df=length(unique(data))-length(params)-1){
  if(!require(pacman)) {install.packages('pacman')}
  pacman::p_load(MASS)
  params=fitdistr(data,model)$estimate
  exptd=dens(unique(data),params)
  exptd[length(exptd)]=1-sum(exptd[-length(exptd)])
  normalise=sum((table(data)-length(data)*exptd)^2/(length(data)*exptd))
  cat('pvalue','=',pchisq(normalise,df,lower.tail = F),';','df','=',df)
  if(any(length(data)*exptd<5)) {print("Warning: there are some expected frequencies that are less than 5")}
}

data should be a vector class.

model is the model defined under 'densfunc' under 'fitdistr' under package MASS... should not be write as object, but under quotation marks.

dens is the density function from default stats package.

df is set default, but you can also add by df=n. where n is positive Integer.

Data

data
   0    1    2    3    4 
2962  382   47   25    4 

data2
   0    1    2 
2962  382   76 

Now, trials

chi2test(data,'Poisson',dpois)

Ouput:- pvalue = 6.845853e-91 ; df = 3[1] "Warning: there are some expected frequencies that are less than 5"

 chi2test(data2,'Poisson',dpois)

Output:- pvalue = 5.72538e-13 ; df = 1

chi2test(data2,'Poisson',dpois,df=2)

Output:- pvalue = 5.267489e-12 ; df = 2

chi2test(data,'geometric',dgeom)

Output:- pvalue = 2.281883e-09 ; df = 3[1] "Warning: there are some expected frequencies that are less than 5"

chi2test(data2,'geometric',dgeom)

Output:- pvalue = 0.05292066 ; df = 1

chi2test(data2,'geometric',dgeom,df=2)

Output:- pvalue = 0.1536295 ; df = 2

Related Question