Solved – Different quantiles of a fitted GPD in different R packages

extreme valuemaximum likelihoodquantilesr

I am performing an extreme value analysis for meteorological data, to be precise for precipitation data available in mm/d. I am using a threshold excess approach for estimating the parameters of a generalized Pareto distribution with a maximum likelihood method.

The aim is to calculate several return levels (i.e. the 2, 5, 10, 20, 50, 100 year event) for daily precipitation.

While the R code works fine, I am wondering why I get clearly different results when calculating the quantiles (and thus the return levels) of the fitted GPD with different packages. Even though the estimated parameters of the GPD are almost identical in each package, the quantiles differ a lot.

The packages I used are:
ismev, extRemes, evir and POT.

I guess that the different estimates for the parameters of the GPD are due to different calculation routines, but I do not understand why the calculation of the quantiles differs that much depending on the different packages.

# packages
library(ismev)
library(extRemes)
library(evir)
library(POT)
library(lmom)

th <- 50

# sample data:
potvalues <- c(
  58.5,44.2,49.6,59.3,48.3,60.9,94.5,47.1,45.3,57.6,48.2,46.2,44.2,50.6,42.1,52.7,80.9,
  58.5,51.3,48.4,51.7,71.9,60.1,64.4,43.5,55.5,49.3,58.2,47.5,43.7,45.2,52.8,42.2,46.4,
  96.1,47.5,50.1,42.4,60.9,72.6,51.6,59.4,80.5,63.7,59.9,45.0,66.7,47.6,53.3,43.1,51.0,
  46.2,53.6,59.8,51.7,46.7,42.6,44.5,45.0,50.0,44.0,89.9,44.2,47.8,53.3,43.0,55.7,44.6,
  44.6,54.9,45.1,43.9,78.7,45.5,64.0,42.7,47.4,57.0,105.4,64.3,43.2,50.4,80.2,49.9,71.6,
  47.4,44.1,47.6,55.2,44.4,78.6,50.8,42.4,47.1,43.5,51.4)

#------------------------------------------------------------------------------------------#

# MLE Fitting of GPD - package extRemes

# fit gpd
pot.ext <- fevd(potvalues, method = "MLE", type="GP", threshold=th)

# return levels:
rl.extremes <-  return.level(pot.ext, conf = 0.05,
                             return.period= c(2,5,10,20,50,100))
rl.extremes <- as.numeric(rl.extremes)

#------------------------------------------------------------------------------------------#

# MLE Fitting of GPD - package ismev

pot.gpd <- gpd.fit(potvalues, threshold=th)

s1 <- quagpa(f=.99, para=c(pot.gpd$threshold, pot.gpd$mle)) # 100
s2 <- quagpa(f=.98, para=c(pot.gpd$threshold, pot.gpd$mle)) #  50
s3 <- quagpa(f=.95, para=c(pot.gpd$threshold, pot.gpd$mle)) #  20
s4 <- quagpa(f=.90, para=c(pot.gpd$threshold, pot.gpd$mle)) #  10
s5 <- quagpa(f=.80, para=c(pot.gpd$threshold, pot.gpd$mle)) #   5
s6 <- quagpa(f=.50, para=c(pot.gpd$threshold, pot.gpd$mle)) #   2

rl.ismev <- c(s6, s5, s4, s3, s2, s1)

#------------------------------------------------------------------------------------------#

# MLE Fitting of GPD - package evir

# fit gpd
gpd.evir <- gpd(potvalues, threshold=th)

# plot
evirplot <- plot(gpd.evir)
1 # Excess Distribution
0 # exit

x100 <- gpd.q(pp=.99, x=evirplot) # 100
x050 <- gpd.q(pp=.98, x=evirplot) #  50
x020 <- gpd.q(pp=.95, x=evirplot) #  20
x010 <- gpd.q(pp=.90, x=evirplot) #  10
x005 <- gpd.q(pp=.80, x=evirplot) #   5
x002 <- gpd.q(pp=.50, x=evirplot) #   2

rl.evir <- t(round(rbind(x002,x005,x010,x020,x050,x100),1))
rl.evir <- as.numeric(rl.evir[2,])

#------------------------------------------------------------------------------------------#

# MLE Fitting of GPD - package POT

gpd.pot <- fitgpd(potvalues, threshold=th)
quant = c(0.50, 0.80, 0.90, 0.95, 0.98, 0.99)
rtp <- c(2,5,10,20,50,100)

retvec <- vector()
for (i in quant){
  x <- POT::qgpd(i, loc = th, scale = as.numeric(gpd.pot$param[1]),
                shape = as.numeric(gpd.pot$param[2]))
  retvec <- c(retvec,x)
}

rl.pot <- retvec

#------------------------------------------------------------------------------------------#
# comparison of results - return periods
(result <- cbind(rl.extremes,rl.ismev, rl.evir, rl.pot))

#------------------------------------------------------------------------------------------#
# comparison of estimated parameters
param.extremes <- pot.ext$results$par # extremes
param.ismev <- pot.gpd$mle # ismev
    param.evir <- c(gpd.evir$par.ests[2],gpd.evir$par.ests[1])  # evir
    param.pot <- gpd.pot$param # POT

parameters <- cbind(param.extremes, param.ismev , param.evir, param.pot)
round(parameters, 4)

#------------------------------------------------------------------------------------------#

Best Answer

You could try also evd or Renext packages.

As you noted, all packages functions returned quite similar parameters, so all fitted models embed the same quantile function and the problems come from differing time scales.

For POT models, I believe useful to think of an event rate, say $\lambda$, which is estimated as well as the model for the levels $X_i$ or for excesses $Y_i = X_i-u$, where $u$ is the threshold. The estimated rate $\lambda$ is the number of levels $X_i$ with $X_i>u$ per unit of duration. If you want to use years to compute return levels (RL), you should consider that $\lambda$ expresses in event by year (or inverse year). The return level for period $T$ is then given by $$ x(T) = q_X(1 - 1/\lambda/T) = u + q_Y(1 - 1/\lambda/T) $$ where $q_X$, $q_Y$ are the quantile functions. This formula can be used for $T > 1/\lambda$.

The rate $\lambda$ can be estimated by $\widehat{\lambda} = n_Y /w$ where $w$ is the duration of the fit period and $n_Y$ is the number of exceedances, here $48$. Curiously enough, most R packages do not require $w$ or even allow the user to give it. They rather use a convention which must be well understood by carefully reading the documentation and the examples. For extremes::fevd, the default values of the formals will correspond to $w = 96/365$ years, so $\lambda =182.5$ evt/year in your example. With ismev::gpd.fit you can specify $w$ using a value $n_X / w$ for the npy formal. Similarly evd::fpot has a npp argument for that.

Note that the parameters for the GP excesses do not depend on (``are orthogonal to'') $\lambda$, and their estimation is not impacted by the assumption on $\lambda$ which will however be crucial for RLs. Yet fitted model have an estimation of $\lambda$ used to compute the RLs.

Except for extremes, you computed the return levels yourself, not relying on a prediction function and assuming that $\lambda =1$. This may not be appropriate, because it does not seem that the duration of your fitting period is $48$ years.

## none of this fits uses the good duration?
library(Renext)
## matches (nearly) the computed quantiles labelled 'ismev'
fRen1 <- Renouv(x = potvalues, threshold = 50, effDuration = 96, 
                distname.y = "gpd")
predict(fRen1, newdata = c(2, 5, 10, 20, 50, 100))

##  matches (nearly) the computed quantiles labelled 'evir' and 'pot'
fRen2 <- Renouv(x = potvalues, threshold = 50, effDuration = 48, 
                distname.y = "gpd")
predict(fRen2, newdata = c(2, 5, 10, 20, 50, 100))

## matches (nearly) the computed quantiles labelled 'extremes'
fRen3 <- Renouv(x = potvalues, threshold = 50, effDuration = 96/365,
                distname.y = "gpd")
predict(fRen3, newdata = c(2, 5, 10, 20, 50, 100))

Related Solutions

Solved – Fitting custom distributions by MLE

This answer assumes $\mu$ is known.

One very flexible way to get MLE's in R is to use STAN via rstan. STAN has a reputation for being an MCMC tool, but it also can estimate parameters by variational inference or MAP. And you're free to not specify the priors.

In this case, what you're doing is very similar to their hurdle-model example. Here is the STAN code for that example.

data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0)
      target += bernoulli_lpmf(1 | theta);
    else
      target += bernoulli_lpmf(0 | theta)
                  + poisson_lpmf(y[n] | lambda);
  }
}

To adapt this for your own use, you could:

Replace poisson_lpmf with the log-density for your $f_A$.
Add a third case to the if-else so that it checks for exceeding $\mu$, not just 0. As the meat of that third case, use the log pmf for your extreme value distribution of choice.
Replace bernoulli_lpmf with categorical_lpmf and make the mixture probability parameter into a vector.
To incorporate covariates, you can add regression parameters, and make all your other parameters functions of them. It may help to use categorical_logit_lpmf in place of categorical_lpmf.
Truncate one mixture component at $\mu$ from above and the other at $\mu$ from below, depending on your perspective on the dilemma raised by Jarle Tufto in the comments. It seems like you could get VERY different estimates depending on how exactly you decide to handle this. A nice sanity check: generate a fake dataset from the fitted parameters and make sure it has the right amount at 0, amount above $\mu$, etc.

Once you have a file with the right STAN code, you can use STAN with lots of different toolchains. To use it with R, check out these examples. I simplified one to get an MLE, using rstan::optimizing instead of sampling:

install.packages("rstan")
library("rstan")
model = stan_model("Example1.stan")
fit = optimizing(model)

There are also some tricks for faster/better optimization that could help in practice.

Best Answer

Related Solutions

Solved – Fitting custom distributions by MLE

Related Question