Different estimation of parameters when sampling from generalized pareto in loop

I’m getting different estimates of the parameters k and sigma for some of the individuals when running the model on a loop. (For clarity, I will refer to the individuals within the groups as individuals. There are 22 individuals in the study.)

The individual (non-looped) estimates are the correct parameter estimates in that they fit the data much better.

I’ll start with the Stan code, which is very similar to Aki’s case study on the generalized pareto found here except I’ve pared it down and extended it to multiple groups.

functions {
  real gpareto_lpdf(vector y, real ymin, real k, real sigma, real s) {
    // generalised Pareto log pdf 
    real N = s;
    real inv_k = inv(k);
    //if (k<0 && max(y-ymin)/sigma > -inv_k)
    //  reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
    else
      return -sum(y-ymin)/sigma -N*log(sigma); // limit k->0
      }
}
data {
   // number of observations
  int<lower=1> N;
  // number of groups
  int<lower=-1> K;
  //observations
  vector<lower=0>[N] y;
  // group sizes 
  int<lower=0> s[K];
  // ymax
  vector<lower=0>[K] ymax;
//threshold
  vector<lower=0>[K] ymin;
}

parameters {
vector<lower=0>[K] sigma; 
vector<lower=-sigma[K]/(ymax[K]-ymin[K])>[K] k; 
}
model {
	int pos;
	pos = 1;
  
  for(i in 1:K){
  segment(y, pos, s[i]) ~ gpareto(ymin[i], k[i], sigma[i], s[i]);
  pos = pos + s[i];
  }
}

In the model block you can see that the model runs in a for loop for K different groups (not to be confused with the parameter k).

For most of the individuals, I get the same estimates for the parameters regardless of whether I’m running the model in a loop or for the individual only.

However, for individual 4, I get a very different estimate. Without the loop, I get parameter k = -1.5 and sigma = 6.4. However, if I run the model in the for loop, I get k=.08 and sigma = 2.75. Something is wrong with the loop model.

Rplot

My best guess as to why this is happening is that for individual 4, the estimate of the parameter k is negative. (When k < 0, the Generalized Pareto becomes a Weilbull distribution instead of a Frechet.)

What I would like is for the model to provide the better estimates (allowing for negative values for k) when running on a loop.

Strangely, if I run the loop for the first four individuals only, I get the correct parameter estimates for individual 4. It is only when I add more individuals to the model that these estimates change.

Also, when I include individuals 20, 21, and 22 to the model, the model samples radically diverge. I suspect this happens for the same reason because when I model only these individuals without the loop, their parameter estimates for k are also negative.

data
ccl.csv (740.9 KB)
customer-thresholds.csv (393 Bytes)

correct estimates for individuals 4, 20, 21 and 22
indv_params_means_extra.csv (135 Bytes)

r script for stan model- note you can filter by customers in the dplyr pipeline
02-individual-effects-model.R (2.0 KB)

stan script (with loop- same script as in-line)
individual-effects.stan (1.2 KB)

stan script (original)
original.stan (2.5 KB)

Looks like the problem is here. This selects the last sigma, ymax, ymin, and then sets the lower bound for every k based on those. Everything except the last group will have the wrong lower bound.
The next release of CmdStan will allow array-bounds and you’ll be able to do

vector<lower=-sigma ./ (ymax-ymin)>[K] k; 

but until then this workaround is needed.

parameters {
  vector<lower=0>[K] k_offset;
}
transformed parameters {
  vector[K] k = -sigma ./ (ymax-ymin) + k_offset;
} 
2 Likes