Gamma regression in Stan vs Frequentist approach

gcarto · June 29, 2020, 9:25am

Hi all,

I am trying to run a GLM model in Stan for Gamma distributed data but the model is not converging.
The data are simulated data (so I know that they follow a Gamma distribution) and in fact by fitting the model as a GLM I can get the right estimates for the model parameters.

I’ve already looked at this post but couldn’t find any solution.

Thanks!

Data:

y.csv (3.2 KB)
X.csv (13.2 KB)

Python code:

# Put our data in a dictionary
data = {'N': X.shape[0], 'K': X.shape[1], 'X': X,  'y': y}

# Compile the model
stanm = pystan.StanModel(file='./stan/gamma_model.stan')

# Train the model and generate samples
stan_fit = stanm.sampling(data=data, iter=1000, chains=4, warmup=500, thin=1, seed=101)

COMPILING THE C++ CODE FOR MODEL anon_model_a1534e6d883628ba3a5700d4305b8abc NOW.
n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated
Rhat above 1.1 or below 0.9 indicates that the chains very likely have not mixed
5 of 2000 iterations ended with a divergence (0.25 %).
Try running with adapt_delta larger than 0.8 to remove the divergences.
32 of 2000 iterations saturated the maximum tree depth of 10 (1.6 %)
Run again with max_treedepth larger than 10 to avoid saturation
Chain 1: E-BFMI = 0.0119
Chain 2: E-BFMI = 0.00533
Chain 3: E-BFMI = 0.000414
Chain 4: E-BFMI = 0.0019
E-BFMI below 0.2 indicates you may need to reparameterize your model

sm_fit = sm.GLM(y,X, family=sm.families.Gamma(link=sm.families.links.log)).fit()
print(sm_fit.summary())

Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  177
Model:                            GLM   Df Residuals:                      172
Model Family:                   Gamma   Df Model:                            4
Link Function:                    log   Scale:                        0.052743
Method:                          IRLS   Log-Likelihood:                -2443.8
Date:                Mon, 29 Jun 2020   Deviance:                       7.6958
Time:                        09:08:28   Pearson chi2:                     9.07
No. Iterations:                    11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.1812      0.005     33.899      0.000       0.171       0.192
x2             0.0485      0.005      9.632      0.000       0.039       0.058
x3            -0.0600      0.006     -9.760      0.000      -0.072      -0.048
x4            -0.0218      0.008     -2.753      0.006      -0.037      -0.006
const         13.7931      0.019    719.882      0.000      13.756      13.831

Stan model:

data {
  int N; //the number of observations
  int K; //the number of columns in the model matrix
  real y[N]; //the response
  matrix[N,K] X; //the model matrix
}
parameters {
  vector[K] betas; //the regression parameters
  real phi; //the variance parameter
}
transformed parameters {
  vector[N] mu; //the expected values (linear predictor)
  vector[N] alpha; //shape parameter for the gamma distribution
  vector[N] beta; //rate parameter for the gamma distribution
  
  mu <- exp(X*betas); //using the log link 
  alpha <- mu .* mu / phi; 
  beta <- mu / phi;
}
model {  
  betas[1] ~ cauchy(0,10); //prior for the intercept following Gelman 2008

  for(i in 2:K)
   betas[i] ~ cauchy(0,2.5);//prior for the slopes following Gelman 2008
  
  y ~ gamma(alpha,beta);
}
generated quantities {
 vector[N] y_rep;
 for(n in 1:N){
  y_rep[n] <- gamma_rng(alpha[n],beta[n]); //posterior draws to get posterior predictive checks
 }
}

Max_Mantei · June 29, 2020, 10:38am

Hi!

As in the other thread, I’d suggest you change the parameterization to

\begin{align} y &\sim \text{Gamma}(\alpha,\beta) \\ \alpha &= \phi^{-1} \\ \beta &= \frac{\phi^{-1}}{\mu} \end{align}

and you also need a <lower=0> on this line

and a prior for phi (or inverse_phi if you follow my proposed parameterization).

The priors on the regression coefficients are fairly wide. Something like N(0,5) would probably be fine for the intercept, and I’d actually go for something like N(0,1) for the other coefficients since they are on the log-scale.

Also, it sometimes helps to center and scale your covariates (if you have not already done this). And if the covariates are correlated a QR decomposition might be helpful.

Cheers!
Max

Side note: This

can be simplified to the one-liner betas[2:] ~ cauchy(0,2.5); or betas[2:] ~ std_normal(); if you want to try the prior I proposed above.

Cheers!
Max

gcarto · June 29, 2020, 10:59am

Thanks Max!

With this new parameterisation it works!

New model code:

data {
  int N; //the number of observations
  int K; //the number of columns in the model matrix
  real y[N]; //the response
  matrix[N,K] X; //the model matrix
}
parameters {
  vector[K] betas; //the regression parameters
  real<lower=0> inverse_phi; //the variance parameter
}
transformed parameters {
  vector[N] mu; //the expected values (linear predictor)
  vector[N] beta; //rate parameter for the gamma distribution
  
  mu <- exp(X*betas); //using the log link 
  beta <- rep_vector(inverse_phi, N) ./ mu;
}
model {  
  betas[1] ~ cauchy(0,10); //prior for the intercept following Gelman 2008

  betas[2:] ~ cauchy(0,2.5);//prior for the slopes following Gelman 2008
  
  inverse_phi ~ exponential(1); // prior on inverse dispersion parameter
  y ~ gamma(inverse_phi,beta);
}
generated quantities {
  vector[N] y_rep;
  for(n in 1:N){
  y_rep[n] <- gamma_rng(inverse_phi,beta[n]); //posterior draws to get posterior predictive checks
 }
}

Max_Mantei · June 29, 2020, 11:01am

Great! :)

fabio · June 29, 2020, 12:24pm

@Max_Mantei Just for curiosity (I never tried to tame such a beast…) but should we need a Jacobian transformation over the parameters?

Max_Mantei · June 29, 2020, 1:49pm

Hey! I’m not quite sure if I understand correctly (I guess you mean the regression coefficients?), but a Jacobian transformation of the parameters is not necessary here.

Cheers,
Max

fabio · June 29, 2020, 1:53pm

I mean 21.3 Changes of Variables of the Stan user manual. But if it’s not necessary, it’s super-ok for me!

Max_Mantei · June 29, 2020, 3:36pm

Rule of thumb: If you put your prior on the parameters as you have defined them in the parameters block, then you are fine (in the example above betas and inverse_phi are assigned priors directly, so that’s fine).

Amos · August 6, 2021, 11:05pm

I’m really curious why this reparameterization works and the original parameterization fails so badly? (I had tried the same thing as gcarto and, apparently, others, before finding this page.)

Elef · March 25, 2024, 8:01pm

Hi, have you got any idea about it? I know it’s 2 years later but still there is no much about it

spinkney · March 25, 2024, 9:46pm

First, the original post didn’t constrain phi to be > 0 so there’s tons of problems just with that. Add real<lower=0> phi. The cauchy priors probably don’t help and Gelman doesn’t recommend those anymore. Try something like normal(0, 4) that is reasonably diffuse.

An issue with the gamma distribution shows correlation with the parameters in the fisher information matrix. See Posterior estimates of rate and shape of gamma distribution are dependent.

The given parameterization from @Max_Mantei and the one in the post I linked to above give 0s for the off diagonal part of the Fisher information matrix which will help sampling.

Btw here’s what I get with the OP’s data and this slightly updated model with no additional reparameterization of the gamma distribution:

  variable    mean  median      sd     mad      q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 betas[1]  0.178   0.178  0.00419 0.00411  0.170   0.184   1.00    1818.    2392.
2 betas[2]  0.0471  0.0471 0.00378 0.00379  0.0410  0.0534  1.00    4319.    2902.
3 betas[3] -0.0693 -0.0693 0.00465 0.00472 -0.0769 -0.0615  1.00    3408.    2427.
4 betas[4] -0.0228 -0.0228 0.00454 0.00457 -0.0304 -0.0155  1.00    3950.    3139.
5 betas[5] 13.8    13.8    0.0203  0.0200  13.8    13.8     1.00    1673.    2336.

data {
  int N; //the number of observations
  int K; //the number of columns in the model matrix
  vector[N] y; //the response
  matrix[N,K] X; //the model matrix
}
parameters {
  vector[K] betas; //the regression parameters
  real<lower=0> phi; //the variance parameter
}
transformed parameters {
  vector[N] mu = exp(X * betas); //the expected values (linear predictor)
  vector[N] alpha = mu .* mu / phi; //shape parameter for the gamma distribution
  vector[N] beta = mu / phi; //rate parameter for the gamma distribution
}
model {  
  betas ~ normal(0, 4);

  y ~ gamma(alpha,beta);
}

spinkney · March 26, 2024, 1:39pm

FYI all I opened an issue to add an alternative parameterization

github.com/stan-dev/math

Add orthogonal parameterization of gamma distribution

opened 01:39PM - 26 Mar 24 UTC

spinkney

## Description The current parameterization of the gamma distribution results i…n correlated parameters that bias estimates when doing gamma regression. An alternative parameterization was discussed on the forums at https://discourse.mc-stan.org/t/posterior-estimates-of-rate-and-shape-of-gamma-distribution-are-dependent/3220/14?u=spinkney. I decided to look at 3 of the parameterizations using this problem on the forums, https://discourse.mc-stan.org/t/gamma-regression-in-stan-vs-frequentist-approach/16274/11. 1. Current Stan parameterization 2. The parameterization suggested by Max Mantei in the 2nd thread above 3. The parameterization suggested by JohnnyZoom4H in the 1st thread above Frequentist Fit for comparison given in the forums ``` coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ x1 0.1812 0.005 33.899 0.000 0.171 0.192 x2 0.0485 0.005 9.632 0.000 0.039 0.058 x3 -0.0600 0.006 -9.760 0.000 -0.072 -0.048 x4 -0.0218 0.008 -2.753 0.006 -0.037 -0.006 const 13.7931 0.019 719.882 0.000 13.756 13.831 ``` All models were run on a mac m1promax using cmdstanr with 4 parallel chains and everything else default. **Model 1: Current Stan parameterization** Notes: Lots of warnings during warmup about Shape parameter[2] is inf Total execution time: 1.0 seconds ```stan data { int N; //the number of observations int K; //the number of columns in the model matrix vector[N] y; //the response matrix[N,K] X; //the model matrix } parameters { vector[K] betas; //the regression parameters real<lower=0> phi; //the variance parameter } transformed parameters { vector[N] mu = exp(X * betas); //the expected values (linear predictor) vector[N] alpha = mu .* mu / phi; //shape parameter for the gamma distribution vector[N] beta = mu / phi; //rate parameter for the gamma distribution } model { betas ~ normal(0, 4); y ~ gamma(alpha,beta); } ``` Fit of betas ``` # A tibble: 5 × 10 variable mean median sd mad q5 q95 rhat ess_bulk ess_tail <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 betas[1] 0.177 0.177 0.00418 0.00412 0.171 0.184 1.00 1888. 2375. 2 betas[2] 0.0469 0.0469 0.00385 0.00387 0.0404 0.0534 1.00 4099. 3191. 3 betas[3] -0.0692 -0.0691 0.00452 0.00461 -0.0766 -0.0620 1.00 4077. 2936. 4 betas[4] -0.0228 -0.0228 0.00462 0.00463 -0.0303 -0.0151 1.00 3902. 2627. 5 betas[5] 13.8 13.8 0.0202 0.0205 13.8 13.8 1.00 1774. 1851. ``` **Model 2: Max Mantei's suggested parameterization** Notes: Lots of warnings during wamup about shape parameter is 0, but must be positive finite. Total execution time: 0.6 seconds ```stan data { int N; //the number of observations int K; //the number of columns in the model matrix vector[N] y; //the response matrix[N,K] X; //the model matrix } parameters { vector[K] betas; //the regression parameters real<lower=0> invphi; //the variance parameter } transformed parameters { vector[N] mu = exp(X * betas); //the expected values (linear predictor) vector[N] beta = invphi / mu ; //shape parameter for the gamma distribution } model { betas ~ normal(0, 4); //prior for the intercept following Gelman 2008 invphi ~ exponential(1); y ~ gamma(invphi,beta); } ``` Fit of betas ``` # A tibble: 5 × 10 variable mean median sd mad q5 q95 rhat ess_bulk ess_tail <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 betas[1] 0.181 0.181 0.00543 0.00551 0.173 0.190 1.00 5011. 3282. 2 betas[2] 0.0483 0.0483 0.00520 0.00531 0.0397 0.0566 1.00 4159. 3110. 3 betas[3] -0.0599 -0.0601 0.00598 0.00595 -0.0697 -0.0501 1.00 5119. 3138. 4 betas[4] -0.0218 -0.0218 0.00843 0.00854 -0.0358 -0.00810 1.00 4824. 3090. 5 betas[5] 13.8 13.8 0.0198 0.0197 13.8 13.8 1.00 4208. 2571. ``` **Model 3: JohnnyZoom4H's suggested parameterization** Notes: No warnings! Total execution time: 0.4 seconds ```stan functions { real johnnys_gamma_lpdf(vector x, real tau, vector mu) { int N = num_elements(x); return (tau - 1) * sum(log(x)) + tau * sum(log(tau) - log(mu)) - N * lgamma(tau) - sum(x * tau ./ mu); } } data { int N; //the number of observations int K; //the number of columns in the model matrix vector[N] y; //the response matrix[N,K] X; //the model matrix } parameters { vector[K] betas; //the regression parameters real<lower=0> tau; //the variance parameter } transformed parameters { vector[N] mu = exp(X * betas); //the expected values (linear predictor) } model { betas ~ normal(0, 4); //prior for the intercept following Gelman 2008 tau ~ exponential(1); y ~ johnnys_gamma(tau, mu); } ``` Fit of betas ``` variable mean median sd mad q5 q95 rhat ess_bulk ess_tail <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 betas[1] 0.181 0.181 0.00552 0.00554 0.172 0.190 1.00 4753. 3146. 2 betas[2] 0.0484 0.0483 0.00514 0.00518 0.0400 0.0569 1.00 4772. 3333. 3 betas[3] -0.0599 -0.0600 0.00619 0.00612 -0.0701 -0.0497 1.00 5290. 3082. 4 betas[4] -0.0216 -0.0216 0.00845 0.00831 -0.0356 -0.00794 1.00 5227. 3273. 5 betas[5] 13.8 13.8 0.0198 0.0197 13.8 13.8 1.00 4511. 3077. ``` Summary Clearly, parameterizations 2 and 3 are superior for this regression. I believe 3 is the best and when coded in cpp and given derivatives may increase efficiency.

Topic		Replies	Views
Parametrisation of generalised gamma distribution amenable to Stan General fitting-issues	17	2854	March 14, 2024
Gamma Regression in Stan VERY different than Frequentist from lme4 Modeling fitting-issues , performance	16	2640	December 9, 2018
When to use reparameterization in gamma regression? Modeling rstan , techniques , fitting-issues	7	129	January 21, 2025
Gamma(.., ..)^2 -> generalized_gamma(?, ?, ?) Modeling fitting-issues	23	4493	February 7, 2018
Non centered parameterization on variance parameter Modeling	31	8530	October 22, 2018

Gamma regression in Stan vs Frequentist approach

Related topics