I’m working on simulating many sequences streamflow from a “known” (synthetic) model and fitting it to a GEV distribution. Generally the model converges quite well, but sometimes it converges to a result where mu
is very large and sigma
very small (using notation from Wikipedia. When it does this, unsurprisingly, the model also takes a very long time to fit.
As a hack I can reduce this problem by putting more informative priors on my parameters (i.e. based on the posterior distribution from the sequences that fit well) but this is neither intellectually defensible nor very robust. I’d like to reparameterize the GEV to better address this but don’t have a good idea for a good reparameterization. If anyone has any resources they would like to point me towards, it would be much appreciated.
/*
Fit a GEV distribution in stan
Code is original but heavily based on previous work at:
http://bechtel.colorado.edu/~bracken/tutorials/stan/
http://discourse.mc-stan.org/t/troubles-in-rejecting-initial-value/1827
*/
functions{
real gev_lpdf(vector y, real mu, real sigma, real xi) {
vector[rows(y)] z;
vector[rows(y)] lp;
int N;
N = rows(y);
for(n in 1:N){
z[n] = 1 + (y[n] - mu) * xi / sigma;
lp[n] = (1 + (1 / xi)) * log(z[n]) + pow(z[n], -1/xi);
}
return -sum(lp) - N * log(sigma);
}
}
data {
int<lower=0> N;
vector[N] y;
}
transformed data {
real min_y;
real max_y;
real sd_y;
min_y = min(y);
max_y = max(y);
sd_y = sd(y);
}
parameters {
real<lower=-0.5,upper=0.5> xi;
real<lower=0> sigma;
// location has upper/lower bounds depending on the value of xi
real<lower=if_else( xi > 0, min_y, negative_infinity()),
upper=if_else( xi > 0, positive_infinity(), max_y )> mu;
}
model {
// Priors -- these can be modified depending on your context
real xi_plus_half;
xi_plus_half = xi + 0.5;
xi_plus_half ~ beta(3, 12); // somewhat informative: expectation -0.3 per Martins2000
sigma ~ normal(0, 100); //
mu ~ normal(0, 250); //
// Data Model
y ~ gev(mu, sigma, xi);
}
To better visualize the problems I am seeing, consider these two images. To generate them, I created 100 synthetic data sets and then fit each in stan
using the above model and 1000 posterior draws. The first shows the posterior distribution of mu
and sigma
(for all 100 * 1000 draws). For the second, I took the posterior mean of each parameter for each sequence (independently – this is just for illustration) and plotted the resulting generalized extreme value CDF function (scipy.stats.genextreme
) for each of the 100 sequences. Apologies for the difficulty seeing, but there are two CDFs which are almost horizontal at around y=3000 and y=200 – corresponding to the dots in the first plot.
An example of the data I would fit is (in python). The data is actually floats but this is not all printing to screen.
array([ 1447., 1939., 1100., 720., 1316., 349., 528., 781., 359.,
508., 1338., 379., 524., 353., 485., 279., 295., 541.,
345., 484., 490., 230., 265., 242., 300., 1381., 1493.,
768., 406., 364., 484., 473., 330., 328., 355., 388.,
814., 617., 453., 465., 258., 896., 873., 1028., 841.,
916., 358., 652., 753., 1095., 301., 302., 273., 251.,
561., 999., 1410., 751., 802., 662., 1324., 923., 669.,
1046., 1311., 391., 396., 915., 457., 264., 626., 407.,
411., 724., 441., 503., 428., 359., 539., 405., 727.,
522., 265., 390., 392., 546., 319., 301., 364., 299.,
555., 714., 1306., 397., 458., 464., 406., 445., 405.,
318.])
N.B.
: the above results were actually obtained using the informative (but not justifiable) priors discussed in the original post, rather than the priors shown in the stan
code.