Solved – Warnings in R after “fitdistr” used

fittingr

I have just tried to fit a t-dist in R for some data, and did this by reading in a 21×1 csv file and converting to numeric (can show code used if you think it's important).
It has produced parameters but with 27 warnings all saying 1: In log(s) : NaNs produced etc up to 26: and one saying In sqrt(diag(vc)) : NaNs produced.
Does this affect the validity of the parameters it has come back with?

I also just noticed it has 2 values for mean, sd and df. One on top and one below in brackets? what do these mean?

        m               s               df      
  -0.0001337445    0.0034887033   10.0256862670 
 ( 0.0008246931) ( 0.0000891192) (          NaN)
There were 27 warnings (use warnings() to see them)

Best Answer

This often happens when internally some of the results were invalid, like logarithms of non-positive numbers. In my experience, I have found fitdistr to be very "fragile". Sometimes you will still get results, other times (often with gammas) fitdistr just fails. You are almost always better off writing a small function to return the negative loglikelihood, and passing it directly into optim using Nelder-Mead (to dispense with calculating derivatives, although deriv3 often works).

Related Solutions

Solved – Providing starting values for a Generalized Linear Mixed Model with glmmPQL

I'm sorry to be negative, but I think there are a number of reasons that what you're doing won't work very well.

As a rule of thumb (see e.g. Harrell 2001 Regression Modeling Strategies), you need at least 10 data points per parameter estimated to expect a reliable answer: you have 30 data points for 5 principal component slope parameters, and that's not even counting the random effects variances (which are generally even harder to estimate than fixed effect parameters).
it doesn't work very well to estimate random effects from fewer than approx. 5 levels, so a random effect of year (where you've only measured 2 values) is probably not a good idea.
I don't think lme (on which glmmPQL is built) can handle crossed effects, so without knowing it I think you're probably fitting Year nested within Individual.

Let's look at the data:

source("heidel.dat")  ## get data
library(ggplot2)
theme_set(theme_bw())  ## cosmetic

ggplot(dat,aes(x=PC1,y=Distance))+geom_point(aes(colour=Year))+
    geom_line(aes(group=Animal),alpha=0.3)+scale_y_log10()

enter image description here

There doesn't seem to be much going on here, but if I did want to try I would probably forego the Gamma model and just look at logged data:

library(lme4)
lmer(log(Distance)~PC1+Year+(1|Animal),data=dat)

At the risk of snooping, let's look at the marginal effects of all 5 PC variables to see if any of them looks like it's doing anything:

library(reshape2)
mdat <- melt(dat,id.var=c("Distance","Animal","Year"))

library(grid)
zmargin <- theme(panel.margin=unit(0,"lines"))  ## 
ggplot(mdat,aes(x=value,y=Distance))+geom_point(aes(colour=Year))+
    geom_line(aes(group=Animal),alpha=0.3)+scale_y_log10()+
    facet_grid(.~variable,scale="free_x")+geom_smooth(method="lm")+zmargin

enter image description here

Solved – Logistic predict.glm() function error: factor has new levels

Using my magic crystal ball to see the output of str(DF)...

Aha! SAFE_DRVR_PLEDGE_FLG has another level, "", and that in that same row, one of the other variables is missing. So it's not using that row to do the fit, but when it tries to construct the data matrix to get the predictions, it can't because the level "" exists in the data set but not in the model.

You should probably fix your data set before fitting the model, but you can also get the predictions to work by not telling predict to use the DF data frame; it will then successfully work as it uses the data matrix from the model. However, this will have fewer predictions than there are rows in DF because those rows that were causing trouble were automatically discarded.

Example, based on @smillig's:

d <- data.frame(w=rnorm(100),
                x=rnorm(100), 
                y=sample(LETTERS[1:2], 100, replace=TRUE),
                z=sample(LETTERS[3:4], 100, replace=TRUE) )
d2 <- rbind(d, data.frame(w=1, x=1,y=NA, z=""))
fm2  <- glm(y ~ w + x + z, data=d2, family=binomial)

predict.glm(fm2, d2, type="response", se.fit=FALSE)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
# factor 'z' has new level(s) 

predict(fm2, type="response", se.fit=FALSE)
# No error, this works. However, it has length 100 even though d has 101 rows.

Best Answer

Related Solutions

Solved – Providing starting values for a Generalized Linear Mixed Model with glmmPQL

Solved – Logistic predict.glm() function error: factor has new levels

Related Question