Gamma Regression in Stan VERY different than Frequentist from lme4

I have a simple gamma regression model.

We are modeling score, as a function of group membership. I’ll add more covariates later, once this basic model is stable. Plotting the density of the score shows nine clear peaks.

If we summarize each group, it almost exactly matches the corresponding density peak.

Using lme4 in R with the following formula:

m <-  glmer(score ~ 1 + (1|group), data=df, family=Gamma(link="log"))
 

(Intercept) 4.88003

$group
(Intercept)
1 -0.82205851
2 -0.62208084
3 -0.45549832
4 -0.31192307
5 -0.26049639
6 -0.17649159
7 -0.06063024
8 0.05163697

This look very good compared to the actual data. Generating random Gamma values with these coefficients, and then plotting on top of the original density plot is a VERY close fit.

However, when I attempt to replicate the same model in Stan, to better capture uncertainty, the posterior samples are vastly different. What’s really odd is that regardless of the priors I choose, the posterior is VERY VERY far away. I used values from the Frequentist and/or lme4 models as informative priors. But Stan seems to just ignore them.

data {
    int<lower=0> N;
    real<lower=0> score[N];
    int<lower=0> N_group;
    int<lower=0> group_f[N];
}

parameters {
    real alpha0; // baseline intercept
    vector[N_group] alpha_group;
    vector[N_group] phi;
}

model {
    alpha0 ~ normal(log(80), log(0.5));
    alpha_group ~ normal(0,0.5);
    phi ~ normal(0,1);
    for(i in 1:N){
        real mu_t = exp(alpha0  + alpha_group[group_f[i]]);
        real alpha = square(mu_t) / phi[group_f[i]];
        real beta = mu_t / phi[group_f[i]];
        score[i] ~ gamma(alpha, beta);
    }
}

I would have expected Stan’s posterior samples, for the coefficients, to be somewhat near the Frequentist samples.

Am I:

  1. Doing something wrong with my logic, or model concepts?
  2. Implementing this incorrectly in Stan?

Happy to share some data if anybody wants to try.
(Not sure how to attach the CSV here.)

Any help or suggestions would be greatly appreciated!

vector[N_group] phi;
...
phi ~ normal(0,1);
...
real beta = mu_t / phi[group_f[i]];

Looks like there’s nothing stopping \phi \leq 0 which would break the gamma pdf. Are you getting any divergent transition warnings? If so, that might indicate that the sampler is trying to get into the regions of phi which don’t work. Maybe try a lower bound on phi which would fix this issue, if it is causing the problem.

vector<lower=0>[N_group] phi;

and possibly a boundary avoiding prior

Thanks Scott!

I’ll try that right now and report back!

1 Like

OK, I have a clue as to what’s happening, but not sure how to remedy it.

Looking at the counts by group, I see that Group 2 has a significantly larger number of members:

> table(df$group)

    1     2     3     4     5     6     7     8     9 
 7454 22266 12419  6978 11425  6186  1297   934   219 

For my model, I want the main intercept (alpha0) to have the largest value (population mean), and then the group intercepts to represent deviations from that. This works perfectly in the lme4 model.

Despite what I set for priors, Stan has moved that large value to group_f for group 2.

i.e., in Stan, samples are all around these values:
alpha0 = -0.563625
group_f[1] = -0.227232
group_f[2] = 4.82206

My guess is that there are so many members in that group, that the log posterior is highest when that intercept is the largest, but this is not what I want.

Maybe I need a different structure of priors to force Stan to put the largest coefficient value on the main intercept. Any ideas as to how?

In general: There are a lot of factors that will affect an individual’s score. I’m starting by modelling the random effects, to get all that handled nicely in stand. Then, I’ll add the other covariates on the individual level. So, I really need to main Intercept, alpha0, to represent the population mean. Otherwise the sampler will pull things all over the place as I add covariates.

Perhaps specifying your group-level intercept as a vector of deviations from the overall intercept would be helpful?

Like:

alpha0 ~ normal(log(80), log(0.5));
    alpha_group ~ normal(alpha0, 0.5);

But, then I couldn’t add alpha0 + alpha_group[i] in the linear combination. Are you suggesting to just use alpha_group?

There will also be other intercept “modifiers” later. So it might get messy this way.

I just noticed something:

alpha0 ~ normal(log(80), log(0.5));

Log(0.5) is negative, so you’re using a negative standard deviation in your prior. I’m not even sure how you got that to run

1 Like

I’m in no way an expert in Stan programming, nor in gamma regression, so I can’t promise this is viable, but: An approach could maybe be to vectorise those assignments in the model rather than using a for loop, by making two vectors of length N in the transformed parameters block, where each case gets assigned the group-level alpha and phi? I don’t know, that might make it easier to in corporate later adjustments, by making them all an N-length vector of individual adjustments to the intercept?

For example:

data {
    int<lower=0> N;
    real<lower=0> score[N];
    int<lower=0> N_group;
    int<lower=0> group_f[N];
}
parameters {
    real alpha0; // baseline intercept
    vector[N_group] alpha_group;
    vector[N_group] phi;
}
transformed parameters{
vector[N] alpha_N = alpha_group[group_f];
vector[N] phi_N = phi[group_f];
}
model {
  vector[N] mu_t = exp(alpha0  + alpha_N);
   vector[N] alpha = square(mu_t) / phi_N;
   vector[N] beta = mu_t / phi_N;    
   alpha0 ~ normal(log(80), log(0.5));
    alpha_group ~ normal(alpha0,0.5);
    phi ~ normal(0,1);
    score ~ gamma(alpha, beta);
    }
}

You’re correct. That was a typo in the model I put here. The one I ran uses

alpha0 ~ normal(log(80, 0.5)

Thanks for pointing that out.

1 Like

From a quick glace I would say that you are comparing different models. In the Stan model you posted, you let the dispersion parameter \phi vary by group. Afaik this is not something lme4 can do.

Also, you might want to try non-centered parameterization for \alpha_{\text{group}}.

If you want what lme4::glmer is doing, you are probably better off just using rstanarm::stan_glmer. Is there any reason why you want to write this model yourself?

PS: you might also want to check out this recent thread.

1 Like

Max,

You make a good point. The Stan model is different in that the variance.

Two motivations for not using rstanarm:

  1. I want to allow for different variance per group
  2. I want much tighter control over the priors, based on domain knowledge

I haven’t learned do do those with rstanarm, not sure its possible.

Fair enough. Another advantage of writing the Stan model from scratch (at east for me) is, that it makes you think more about the model and is generally very educative.

Looking at your Stan model again, I couldn’t actually find anything hierarchical in it. There is no standard deviation estimated for the groups. Here’s how I would code a simple hierarchical Gamma regression (without varying dispersion for now). This is also a more general tip: Start simple and make sure everything works and build up from that.

Note that I coded it using non-centered parameterization. Disclaimer: I didn’t actually try that code, as I was only writing it here on the forum. You might have to check for typos etc., but I hope this will give you a more general idea. You can build in the varying dispersion parameter from there, but judging from my own experience, I guess this can be hard to fit.

data {
    int<lower=0> N;
    real<lower=0> score[N];
    int<lower=0> N_group;
    int<lower=0> group_f[N];
}

parameters {
    real alpha0; // baseline intercept
    real<lower=0> sigma_group;
    vector[N_group] alpha_group_tilde;
    real<lower=0> inverse_phi;
}

transformed parameters{
    vector[N_group] alpha_group = sigma_group * alpha_group_tilde;
}

model {
    alpha0 ~ normal(log(80), 0.5); // prior for grand mean

    sigma_group ~ exponential(1); // prior for group standard deviation
    alpha_group_tilde ~ std_normal(); // non-centered parameterization

    inverse_phi ~ exponential(1); // prior on inverse dispersion parameter

    // likelihood
    score ~ gamma(inverse_phi, rep_vector(inverse_phi, N) ./ exp(alpha_0 + alpha_group[group_f])

}

I hope that this is clear. Please ask, if you have more questions. :)

1 Like

Thanks!

I’ve been playing with a very similar paramterization. It WORKS, but there is still something wrong.

The first approximately 400 samples takes 20-30 minutes. Tailing the samples.csv file, the posterior draws for the parameters strongly match my intuition on what they should be. I even took a few, the plotted them in R compared to the data, and the fits look great.

THE, the model seems to get stuck. While the first 400 samples too 30 minutes, the next 50 samples too 8 HOURS.

What’s confusing to me is that the model starts sampling quickly, the values look great, and then it gets stuck. My guess is that Stan has somehow gotten stuck in some corner of sampling space.

I’ve tried adjusting the parameters for the priors, but it doesn’t seem to help much.

Here is the latest Stan model:

data {
    int<lower=0> N;
    real<lower=0> score[N];
    int<lower=0> N_group;
    int<lower=0> group_f[N];
}

parameters {

    // baseline intercept
    real alpha0; 

    // Random effects
    vector[N_group]  group_eta;
    real<lower=0>   group_scale;

    vector[N_group]  phi_eta;
    real<lower=0>   phi_scale;
}

transformed parameters{
    vector[N_group] group_re;
    vector[N_group] phi_group;

    group_re = group_scale * group_eta;
    phi_group = phi_scale * phi_eta;

}

model {

    vector[N] mu;
    vector[N] phi;

    group_eta    ~ normal(0,1);
    group_scale  ~ exponential(.1);

    phi_eta    ~ normal(0,1);
    phi_scale  ~ exponential(.1);

    alpha0     ~ normal(5, 10);

    mu  = exp(alpha0 + group_re[group_f]);
    phi = 1 ./ ( exp(phi_group[group_f])) ;

    target += gamma_lpdf(score | phi, phi ./ mu);

}

Did you get any divergences warnings? If possible can you try doing pairs() on your samples and posting the plot here?

I get a few divergence warnings during the first few samples of warmu (maybe in samples 1-10), then nothing.

It is still running, so not sure how to do any pairs() plots. In R, read_stan_csv() returns an error, as the file doesn’t contain as many samples as the header comments indicate.

Thanks!

Completely crawled to a halt after about 680 samples.

Is there any way I can see how/where/why Stan is stuck?

I would parametrize it as inverse_phi = exp(phi0 + phi_group[group_f]) with \phi_0 as the overall log inverse dispersion.

Other than that the model seems fine. If Stan is “getting stuck” (i.e. is slow) it could be that the priors are far away from where the posterior mass is (and it’s hard for Stan to get there). High correlation could also be problematic, as Scott pointed out.

If I see it correctly I you have around 70,000 observations. Stan will be slow on problems like this, even with an efficiently coded model. It may be that in your case the centered parameterization is superior, because you have a lot of data… hard to tell. Could you post some of the results that you get? Including diagnostics (see the robust Bayesian workflow case study by Micheal Betancourt). Divergences can be an indication of wrong parameterization of the varying intercepts (and dispersions). Max treedepth warnings can indicate high correlations in the posterior (slow to sample).

With a dataset this big you might want to consider alternatives to Stan - although you’ll modt probably be less flexible in modeling and less robust in the results you get when you don’t use Stan. I haven’t tried it yet, but read that INLA is pretty fast with large datasets. Afaik it’s approximate Bayesian inference is okay, if you have a relatively simple varying intercepts (random effects) regression model. But other people on this forum know a lot more about pros and cons of INLA vs. Stan.

PS: while you are in “model building phase” you can also work with an appropriate/representative subset of your data to speed things up. From my experience 5,000 - 10,000 observations are much faster to fit (fewer will of course be even faster). And maybe just use 500 iterations (300 warmup, 200 interested with 3-4 chains) to see what kind of warning you get (see above) to adapt your model accordingly. Then with your final model, fit the whole dataset using many iterations (long warm-up, many chains).