Hierarchical model with probability parameter


I have a hierarchical model with a beta prior for a vector of subject-level parameters, with hyperparameters w1 and w2.
I am trying with 4 chains and 15000 iterations (14000 for warmup) and still Rhat ~ 1.3 for w2.
Which is the recommended prior for the probability parameter xi, bounded between 0 and 1, to avoid sampling problems in a hierarchical model?

Best regards, Benjamin

data {
  int<lower=0> N_SUBJECTS;    
  int<lower=0> N_TRIALS;      
  int<lower=0> choice[N_TRIALS, N_SUBJECTS]; 

parameters {
  vector<lower=0, upper=1>[N_SUBJECTS] xi; 
  real<lower=0> w1;
  real<lower=0> w2;

model {
  w1 ~ cauchy(0, 10);
  w2 ~ cauchy(0, 10);
  xi ~ beta(1/w1  + 1, 1/w2 + 1);
  for (t in 1:N_TRIALS) 
    for (s in 1:N_SUBJECTS)
      choice[t, s] ~ bernoulli(Phi_approx(k[t, s]-m[s])*
                                 (1-xi[s]) + xi[s]/2);

Hm I just took a quick look. So you’re modeling the choice probability as being a function of the k. What is m[s] and do you want a scalar parameter in front of the k[t,s]? Can you explain your choice on the probability parameter that goes in to the Bernoulli Phi_approx(k[t, s]-m[s])*(1-xi[s]) + xi[s]/2. Looking at the manual the way the Phi_appox is used is y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n])); so I’m just wondering what the xi multiplication and addition term is doing.

Also just wondering why are you parameterizing the beta distribution in that way instead of the usual alpha and beta? That could affect the posterior geometry and make it harder to sample.

Hi Arya,
I did not write the full model to keep things clear, but the problem seems to be the xi parameter. The chains mix well when I remove it.
xi is a ‘lapse’ parameter, quite standard in models of choice, accounting for noise in the choice probability - the part not explained by the predictors.
I don’t think the manual implies that Phi_approx can be used only in that way, it is just an example. I use it to squash the bernoulli parameter between 0 and 1 (could use inv_logit as well). The multiplication with (1 -xi) and addition of xi/2 keeps the parameter within [0, 1] as long as xi is also in that range.
I initially tried beta(w1 + 1, w2 + 1) but it did not work either. Any other thoughts?

It’s hard to say, but another possibility is the cauchy priors on the w1 and w2 might be hard to sample. See Betancourt’s recent post on that. Maybe try Gaussian with a small standard deviation and go from there. As for the beta, I’d try to stick with the usual way it’s parameterized in Stan.

Another thing you can try if your chains aren’t mixing well, is increasing the tree depth, but usually I’ve found there’s a problem in my model rather than NUTS not going far enough.

If you’re getting divergences, definitely try to check what the parameter values are when you get them using Shinystan. That often gives me good clues of what can be going wrong.

1 Like

Yes, you can use Phi_approx this way—it’s just an invese logit scaled to roughly probit scale. Inverse logit’s a bit more efficient as a result.

You can look to see if probability mass is piling up near the boundaries you put down for parameters—if that happens, you’ll need to reduce step size by increasing adapt_delta in the control parameters (increases target acceptance rate, lowering step size).

The thing to look for in the posterior is the pairs plot—are you sure the model’s still identified with the new parameter? If not, you may need a stronger prior to keep things reasonable.