Solved – Fitting Weibull distribution in R

distributionsprobabilityrweibull distribution

My sample data:

dat <- structure(list(cum.per.plant = c(0.051, 0.263, 0.66, 0.807, 0.91, 
0.981, 1, 0.012, 0.07, 0.256, 0.47, 0.731, 0.9, 0.989, 1, 0.022, 
0.203, 0.472, 0.777, 0.95, 0.991, 1, 0.005, 0.03, 0.222, 0.46, 
0.773, 0.97, 0.989, 1, 0.06, 0.28, 0.77, 0.92, 1, 0.03, 0.14, 
0.46, 0.85, 0.99, 1, 0.06, 0.27, 0.63, 0.95, 1, 0.04, 0.14, 0.36, 
0.78, 0.98, 1, 0.05, 0.17, 0.35, 0.67, 0.86, 0.98, 1, 0.07, 0.28, 
0.62, 0.9, 1, 0.05, 0.22, 0.51, 0.81, 0.99, 1, 0.09, 0.2, 0.46, 
0.83, 1, 0.08, 0.26, 0.66, 0.93, 0.99, 1, 0.02, 0.12, 0.31, 0.61, 
0.95, 1, 0.05, 0.21, 0.49, 0.81, 0.92, 0.98, 1, 0.01, 0.1, 0.31, 
0.68, 0.93, 1, 0.02, 0.14, 0.52, 0.8, 0.93, 1, 0.01, 0.15, 0.41, 
0.74, 0.91, 1, 0.11, 0.31, 0.7, 0.85, 0.95, 1, 0.02, 0.1, 0.56, 
0.88, 0.99, 1, 0.06, 0.19, 0.59, 0.91, 1, 0.01, 0.12, 0.39, 0.7, 
0.96, 1, 0.09, 0.28, 0.67, 0.89, 1, 0.12, 0.3, 0.67, 0.88, 1, 
0.01, 0.2, 0.62, 0.88, 0.98, 1, 0.04, 0.23, 0.56, 0.83, 0.99, 
1, 0.01, 0.16, 0.55, 0.83, 1, 0.02, 0.22, 0.63, 0.91, 1, 0.017, 
0.143, 0.38, 0.837, 0.956, 1, 0.02, 0.086, 0.204, 0.672, 0.933, 
1, 0.008, 0.091, 0.506, 0.86, 0.972, 1, 0.018, 0.174, 0.503, 
0.778, 0.974, 1, 0.01, 0.19, 0.57, 0.78, 0.88, 0.95, 1, 0.06, 
0.28, 0.65, 0.88, 1, 0.03, 0.17, 0.53, 0.82, 1, 0.01, 0.09, 0.34, 
0.71, 0.9, 1, 0.1, 0.43, 0.79, 0.98, 1, 0.03, 0.22, 0.63, 0.87, 
1, 0.07, 0.29, 0.69, 0.92, 1, 0.03, 0.26, 0.62, 0.89, 1, 0.09, 
0.2, 0.37, 0.71, 1, 0.07, 0.2, 0.59, 0.84, 0.96, 1, 0.06, 0.18, 
0.63, 0.86, 0.94, 1, 0.08, 0.27, 0.61, 0.88, 1, 0.02, 0.18, 0.39, 
0.64, 0.94, 1, 0.07, 0.23, 0.47, 0.78, 1, 0.03, 0.2, 0.46, 0.79, 
1, 0.07, 0.17, 0.31, 0.59, 0.71, 0.93, 1), loc.id = c(7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
12L, 12L, 12L, 12L, 12L, 12L, 12L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L), year.id = c(4L, 4L, 4L, 4L, 4L, 4L, 
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 
4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 
3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 
1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), time.id = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 
6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 
3L, 4L, 5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L, 7L, 8L, 3L, 
4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 
8L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L, 
7L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 
6L, 7L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 
6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 
4L, 5L, 6L, 7L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 4L, 5L, 6L, 7L, 
8L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 5L, 6L, 7L, 8L, 
9L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 
6L, 7L, 8L, 9L, 10L, 4L, 5L, 6L, 7L, 8L, 9L, 4L, 5L, 6L, 7L, 
8L, 9L, 5L, 6L, 7L, 8L, 9L, 5L, 6L, 7L, 8L, 9L, 10L, 5L, 6L, 
7L, 8L, 9L, 5L, 6L, 7L, 8L, 9L, 4L, 5L, 6L, 7L, 8L, 9L, 10L)), .Names = c("cum.per.plant", 
"loc.id", "year.id", "time.id"), class = "data.frame", row.names = c(NA, 
-279L))

The data have four columns:

cum.per.plant: cumulative area that is planted by a crop (hence goes from 0 till 1

loc.id: locations where data were collected

year.id: years when the data was collected

time.id: id of the weeks when data were collected.

Hence for loc.id 7 and year.id 4, planting begins from week 2 and reaches 100% in week 8.

I want to implement the below paragraph from this paper if you want to read:
https://www.dropbox.com/s/v36i8npfwbutiro/Yang%20et%20al.%202017.pdf?dl=0

Preliminary analysis of the planting data indicates that once planting
is initiated, the cumulative proportion of fields planted for a crop
in a year at the county level follows a sigmoid pattern, but this can
be modified from planting delays due to soil being too wet, we thus
fitted the observed data to the following modified Weibull
distribution function

ProportionFields = 1 - exp(-(DOY - DOYplanting.initiation -
Days.no.plant/a)^b)

where ProportionFields is the cumulative proportion of fields that
have been planted in a county, DOY is a calendar day of year, (DOY >=
DOYplanting.initiation), DOYplanting.initiation is a calendar day of
year of the earliest planting, Days.no.plant is the total number of
days when planting does not occur due to soil being too wet since the
start of planting. a is a scale parameter and b is a shape parameter.
DOY – DOYplanting.initiation – Days.no.plant represents the total
number of days when planting does not occurr since start of planting.

I want to use the above approach, so I planned to do this:

1) Fit a distribution to the data. In the above example, they fitted Weibull distribution so I also fitted the same

library(fitdistrplus)
fw <- fitdist(dat$cum.per.plant, "weibull")
summary(fw) # shape: 1.2254029, scale: 0.6022573

My first question is:
1) Will the scale parameters and shape parameter be affected by the time step i.e. if the data were collected at daily-level, will my shape and scale parameter get divided by a certain factor?

Now after I get the parameters, I want to implement this equation to calculate the proportion planted each day for a given year and given location.

prop.planted <-  1 - exp(- (x/scale parameter)^shape parameter)

where x = Day of year – Day of year when planting started – No. of days with no planting since the start of planting

I have the data to calculate $x$ i.e.

Is the equation and my understanding correct of the above paper?

EDIT:

Suppose the data follows a beta distribution (and not a Weibull distribution). How can I implement the factor where I calculate x in the beta distribution.

Best Answer

Here's the fitted pdf and cdf (Weibull) for each of locations 1 to 3:

fitted cdf and pdf, weibull

Let's break down what we need to do here, keeping in mind that the end goal is to estimate the cumulative proportion of area planted with a certain crop at some value for the random variable time $X$:

  1. The first step is to fit a distribution (e.g. $Weibull\left(a:= \text{shape},b := \text{scale}\right)$ or $Beta\left(\alpha,\beta\right)$).

    • You made an error in fitting the data on a Weibull distribution because the function fitdist (which uses MLE by default) expects a vector of observations, which would be 'time $X_i$ when some area $i$ was planted with a certain crop'.

      Here is some R that turns the cumulative data into a vector of observations, which can then be used to fit the distribution with fitdist:

      dat <- dat[order(dat$loc.id,dat$year.id),]
      dat$loc.id <- factor(dat$loc.id)
      dat$year.id <- factor(dat$year.id)
      
      library(plyr)
      #hacking the original data from the cumulative
      to.density <- function(cdf) {  #given input (a cdf vector), returns respective densities
        y <- c(); cdf <- cdf[,'cum.per.plant']
        for (cum in seq(1,NROW(cdf))) {y <- c(y, ifelse(cum==1, cdf[cum], cdf[cum]-cdf[cum-1]))}
        return(y)
      }
      
      #apply to each location X year in dat, then kbind density data to dat
      densities <- dlply(dat, .(loc.id, year.id), to.density) 
      dat <- cbind(dat,dens=unlist(densities)) 
      
      #forming a dataset of the variable planting time
      #--because the proportions go to the thousandths place, a pseudo-sample of time.id's with
      #--n=1000 should allow usage of fitdist
      rep.vec <- function(df) {
        y <- c()
        for (row in rownames(df)) {y <- c(y, rep(df[row,'time.id'], df[row,'dens']*1000))}
        return(y)
      }
      
      #apply to each location in dat - may want to include year.id in the split (excluded)
      time.sample <- dlply(dat, .(loc.id), rep.vec) 
      
    • Be sure to assess the 'fit' of your estimated parameters (perhaps by calculating the mean square error) on the data.

    • Your code does not indicate that each location and year will be fitted with its own distribution, although you suggest you want to calculate the cumulative area planted for a given location or year.

      Here is some R for fitting each location:

      #fitting separate Weibull distributions for each loc.id (may want to include year.id in 
      #the split)
      library(fitdistrplus)
      fit.weibull <- function(loc) {
        y <- summary(fitdist(loc,'weibull'))[[1]]
        return(y)
      }
      params <- lapply(time.sample, fit.weibull) #apply to each element in time sample
      
    • Finally, consider the inclusion of a location parameter, which shifts the graph of the pdf in a negative or positive direction along the x-axis; this should be appropriate because in many locations, no area gets plotted until the $X=2^{nd}$ week. The inclusion of a location parameter would likely improve the fit of your distributions.

  2. Use the fitted cdf (with the parameters informed by the previous step) to predict the cumulative proportion of area planted on a certain day for a given location.

    • It is important to understand what you are doing when you want to rescale the random variable time $X$ to represent days rather than weeks, which simply involves dividing the data (a vector of time observations) by 7. This can be seen in the equation for the Weibull pdf itself.

      I've demonstrated it through a few lines of R:

      #creating predictive model - on a daily rather than weekly domain
      predict1.cum.plant <- function(day, loc.id, params) {
        #'scaling' 7x the scale parameter
        pweibull(day, shape=params[[loc.id]][[1]], scale=params[[loc.id]][[2]]*7)  
      }
      
      predict2.cum.plant <- function(day, loc.id, params) {
        #'scaling' 1/7x the random variable time
        pweibull(day/7, shape=params[[loc.id]][[1]], scale=params[[loc.id]][[2]])  
      }
      
      #example - the models are equal, which is obvious from the cdf equation!
      predict1.cum.plant(35, 2, params)
      predict2.cum.plant(35, 2, params) 
      pweibull(5, shape=params[[2]][[1]], scale=params[[2]][[2]]) #by week
      

Some supplemental code of mine can be found here.

Very similar methods can be used to fit a Beta distribution.

Related Question