Parametrisation of generalised gamma distribution amenable to Stan

tillahoffmann · September 29, 2020, 1:41pm

The short version

I would like to fit a generalised gamma distribution to right-skewed data because it encompasses the lognormal, gamma, and Weibull as special cases.

Unfortunately, common parametrisations in terms of shape a, scale b, and exponent c (e.g. see here or the wiki page) leave the parameters highly correlated in the posterior. Stan’s HMC sampler is doing a pretty good job on toy examples, but the correlated posteriors become problematic when the generalised gamma distributions are nested in a hierarchical model. Are you aware of a parametrisation that is more amenable to use with Stan?

The longer version

The standard gamma distribution suffers from a similar problem when it is parametrised in terms of shape a and scale b. Changing the parametrisation to shape a and mean \mu=a/b significantly improves the efficiency of the sampler (see below for an example that can switch between different parametrisations).

data {
    int flag;  // Flag for which parametrisation to use
    int n;  // Number of samples
    vector<lower=0>[n] x;  // Sample values
}

parameters {
    real<lower=0> a;  // Shape of the gamma distribution
    real<lower=0> p;  // Scale or location parameter
}

transformed parameters {
    real<lower=0> b;
    // Parametrise in terms of shape and scale
    if (flag == 0) {
        b = p;
    } 
    // Parametrise in terms of shape and location
    else {
        b = a / p;
    }
}

model {
    x ~ gamma(a, b);
}

Fitting the model to synthetic data with known shape a=3 and scale b=4 yields the following scatter plot of samples from the posterior. The parameter p corresponds to the scale b in parametrisation 1 and to the mean \mu in parametrisation 2.

Parametrisation 2 leaves the parameters uncorrelated in the posterior and is more efficient as indicated by the n_{eff}/n values. This is of course expected because the parametrisation gives rise to a diagonal Fisher information matrix. I’ve experimented with finding a parametrisation \theta such that a=a(\theta) etc. with diagonal Fisher information. But dealing with all the partial derivatives of the log likelihood is rather fiddly, and I was wondering whether you’ve already thought about this in more depth.

bbbales2 · September 30, 2020, 2:12am

Hmm I’m not that familiar with gammas, but parameterizing in terms of a mean seems easier to think about. The Neffs seem really good too. Did you finish the model you were writing this for (if that question makes sense)? How did it work out there?

I assume the \alpha, \beta parameterization comes from conjugate priors for precision on normal distributions. The appeal to conjugate priors is computation, so if this parameterization leads to easier computation for MCMC then that seems like just as-good an argument.

Going by https://cran.r-project.org/web/packages/brms/vignettes/brms_families.html, I think brms uses that parameterization.

tillahoffmann · October 2, 2020, 3:24pm

Nice, thank you for the BRMS pointer.

I didn’t manage to get the generalised gamma to sample in the hierarchical model so instead had a look at the specific limits of the gamma, Weibull, and lognormal. Being able to interpolate between them would be nice, but not essential for my specific problem: estimating the arithmetic population mean of positive values.

In terms of the model, I have some population-level parameters \theta that give rise to parameters \phi_i for a patient i. Those, in turn, give me my observations x_{ij}, where i is the patient index and j is the sample index. The quantity of interest is the distribution over expected values of x for new patients. The gamma and Weibull distributions give pretty similar results, but the lognormal suggests incredibly large means \bar x, presumably because of the heavy tails as discussed by Rubin here (https://doi.org/10.1016/B978-0-12-121160-8.50017-X).

tillahoffmann · October 11, 2020, 9:40pm

A friend pointed me to the parametrisation by Stacey (1962) which is rather neat: if \gamma\sim\mathrm{Gamma}(q^{-2},1), then x=\exp\left(\mu+\sigma/q \log(q^2 \gamma)\right) follows a generalised gamma distribution, and there is an easy mapping between the (q,\mu,\sigma) parametrisatoin and the (a, b, c) parameterization (more here). This parameterization also allows a setup akin to the non-centered parametrisatoin discussed in the section on reparametrization in the documentation by using \gamma rather than x as the latent parameter (although q and \gamma of course remain dependent in the hierarchical prior).

With this setup, I can get the hierarchical model to sample somewhat sensibly but still get a number of divergences. I believe these divergences may be a consequence of \sigma and q being poorly identified (cf. the ratio of \sigma/q in the exponential that defines x and the pair plots below). Any input on how to get this model to sample without divergences would be awesome! I tried redefining x=\exp\left(\mu+\sigma \log(q^2 \gamma)\right) in the hope to make q and \sigma more easily identifiable—but no luck. So here’s the whole story…

Context and model description

I have biomarker measurements y_{ij} of specimen j for patient i. The biomarker concentration can be either a positive number (measured in copies per mL) or a negative test, which means that the concentration was below the level of quantification (LOQ) of the assay. So I model the actual concentration as x_{ij} with a left-censored observation model that gives rise to y_{ij} (the level of quantification is known). The distribution of samples from the patient is informed by parameters q, \mu_i, \sigma at the level of each patient i (the parameters defining the shape of the distribution are pooled amongst all patients). The expected concentration \nu_i for patient i is drawn from a population-level generalised gamma distribution with parameters Q, \lambda, \Sigma, and I determine \mu_i by demanding that \langle y_{ij}\rangle=\nu_i for all j.

In more formal notation,

\begin{align} \nu_i&\sim\mathrm{GeneralisedGamma}(Q,\lambda,\Sigma)\\ \mu_i&=\log\nu_i + \log\Gamma(q^{-2})-\log\Gamma(q^{-2}(1 + q\sigma)) - 2 \sigma\log(q)/q\\ y_{ij}&\sim \mathrm{GeneralisedGamma}(q,\mu_i,\sigma)\\ x_{ij}&=\begin{cases} y_{ij}&\text{if }y_{ij}\geq\theta_{ij}\\ \emptyset&\text{if }y_{ij}<\theta_{ij} \end{cases}, \end{align}

where \theta_{ij} is the level of quantification for the corresponding sample, \emptyset denotes a negative test, and the second line guarantees that \langle y_{ij}\rangle=\nu_i. Here’s the model with sort-of-non-centred parametrisation.

functions {
    // Log pdf for the generalised gamma distribution according to Stacey (1962).
    real gengamma_lpdf(real x, real q, real mu, real sigma) {
        real q2 = q ^ 2;
        return - (sigma * x ^ (q / sigma) * exp(-mu * q / sigma) + (2 - q2) * sigma * log(q) +
            q * (mu - log(x) + q * sigma * (log(sigma * x) + lgamma(1 / q2)))) / (sigma * q2);
    }

    // Log cdf for the generalised gamma distribution according to Stacey (1962).
    real gengamma_lcdf(real x, real q, real mu, real sigma) {
        real log_arg = q / sigma * log(x) - mu * q / sigma - 2 * log(q);
        return log(gamma_p(1 / q ^ 2, exp(log_arg)));
    }
}

data {
    // Information about samples and associations between samples and patients.
    int<lower=0> num_patients;
    int<lower=0> num_samples;
    // Lookup table for patients: i = idx[j] is the patient index for sample j.
    int idx[num_samples];

    // Information about number of positives and negatives for each patient.
    int<lower=0> num_samples_by_patient[num_patients];
    int<lower=0> num_positives_by_patient[num_patients];
    int<lower=0> num_negatives_by_patient[num_patients];

    // Concentration measurements and levels of quantification.
    vector[num_samples] load;
    vector[num_samples] loq;
}

parameters {
    // Population level parameters.
    real population_loc;
    real<lower=0> population_shape;
    real<lower=0> population_scale;

    // Individual level parameters
    real<lower=0> patient_scale;
    real<lower=0> patient_shape;
    // Random variable for non-centred setup.
    vector<lower=0>[num_patients] patient_gamma_;
}

transformed parameters {
    // Evaluate the patient mean in terms of the gamma random variable for an almost
    // non-centred parametrisatoin.
    vector<lower=0>[num_patients] patient_mean = exp(population_loc + population_scale *
        log(population_shape ^ 2 * patient_gamma_) / population_shape);
    // Contribution to the patient_loc, evaluated once for efficiency.
    real loc_contrib_ = lgamma(1 / patient_shape ^ 2) -
        lgamma((1 + patient_shape * patient_scale) / patient_shape ^ 2) -
        2 * patient_scale * log(patient_shape) / patient_shape;
    // Vector to hold the contributions to the target density for each sample.
    vector[num_samples] sample_contrib_;

    // Evaluate contributions to the target.
    for (j in 1:num_samples) {
        // Evaluation the location parameter for this sample.
        real loc = log(patient_mean[idx[j]]) + loc_contrib_;
        if (load[j] > loq[j]) {
            // Account for quantitative observations of the concentration in a sample.
            sample_contrib_[j] = gengamma_lpdf(load[j] | patient_shape, loc, patient_scale);
        } else {
            // Handle left-censoring if the concentration is below the level of quantification.
            sample_contrib_[j] = gengamma_lcdf(loq[j] | patient_shape, loc, patient_scale);
        }
    }
}

model {
    // Sample the latent gamma parameters that induce the generalised gamma prior for the
    // patient mean.
    target += gamma_lpdf(patient_gamma_ | 1 / population_shape ^ 2, 1);
    // Add the contributions from all the samples.
    target += sum(sample_contrib_);
}

On simulated data (gist), the model is able to recover roughly the right input parameters (I haven’t done a proper credible interval cover analysis in light of the divergences). There are still quite a few divergences, especially near the boundaries, and I probably need to put some time into defining sensible priors to tame the divergences. But I was hoping to get the model to sample so I don’t accidentally hide the pathologies by slapping on priors. Anyway, here are the pairs plots with divergences highlighted.

It looks like the curvature variation at the “ends of the population_shape vs population_scale banana” may be the culprit, but I seem to have reached the limits of my reparametrisation skills. Posterior predictive checks seem fine. Any comments would be greatly appreciated!

bbbales2 · October 12, 2020, 12:26pm

Very cool.

Does adapt_delta=0.95 or something help at all? Or is that already turned up?

I think it would be worth trying to run this model with a constant population scale or shape (or both) to get a handle on if either one of these is particularly leading to the divergences (or it’s both).

tillahoffmann · October 13, 2020, 12:30pm

Increasing adapt_delta allows the sampler to go further into the tails of the banana, but it ultimately still diverges (even for adapt_delta = 0.99).

Here’s an expanded model with an indicator flag to choose different setups. Restricting the model in any fashion (flag > 0 below) allows sampling to proceed successfully. In other words, the pathologies only arise when the parameters are inferred jointly. Pair plots are after the model code.

Flag values

0: infer all parameters (desired case)
1: gamma distribution such that q = \sigma
2: shape q pinned to known values
3: scale \sigma pinned to known values

functions {

    // Log pdf for the generalised gamma distribution according to Stacey (1962).
    real gengamma_lpdf(real x, real q, real mu, real sigma) {
        real q2 = q ^ 2;
        return - (sigma * x ^ (q / sigma) * exp(-mu * q / sigma) + (2 - q2) * sigma * log(q) +
            q * (mu - log(x) + q * sigma * (log(sigma * x) + lgamma(1 / q2)))) / (sigma * q2);
    }

    // Log cdf for the generalised gamma distribution according to Stacey (1962).
    real gengamma_lcdf(real x, real q, real mu, real sigma) {
        real log_arg = q / sigma * log(x) - mu * q / sigma - 2 * log(q);
        return log(gamma_p(1 / q ^ 2, exp(log_arg)));
    }

}

data {
    // Information about samples and associations between samples and patients.
    int<lower=0> num_patients;
    int<lower=0> num_samples;
    // Lookup table for patients: i = idx[j] is the patient index for sample j.
    int idx[num_samples];

    // Information about number of positives and negatives for each patient.
    int<lower=0> num_samples_by_patient[num_patients];
    int<lower=0> num_positives_by_patient[num_patients];
    int<lower=0> num_negatives_by_patient[num_patients];

    // Concentration measurements and levels of quantification.
    vector[num_samples] load;
    vector[num_samples] loq;


    // Flag to experiment with parametrisation
    // 0: infer all parameters
    // 1: gamma distribution such that q = sigma
    // 2: shape q pinned to known values
    // 3: scale sigma pinned to known values
    int flag;
    real population_shape_fixed;
    real population_scale_fixed;
    real patient_shape_fixed;
    real patient_scale_fixed;
}

parameters {
    // Population level parameters.
    real population_loc;
    real<lower=0> population_shape_param_;
    real<lower=0> population_scale_param_;

    // Individual level parameters
    real<lower=0> patient_scale_param_;
    real<lower=0> patient_shape_param_;
    // Random variable for non-centred setup.
    vector<lower=0>[num_patients] patient_gamma_;
}

transformed parameters {
    // Declare transformed parameters.
    vector<lower=0>[num_patients] patient_mean;
    real<lower=0> population_shape;
    real<lower=0> population_scale;
    real<lower=0> patient_shape;
    real<lower=0> patient_scale;
    real loc_contrib_;
    // Vector to hold the contributions to the target density for each sample.
    vector[num_samples] sample_contrib_;

    // Evaluate parameters based on the supplied flag
    if (flag == 0) {
        population_shape = population_shape_param_;
        population_scale = population_scale_param_;
        patient_shape = patient_shape_param_;
        patient_scale = patient_scale_param_;
    } else if (flag == 1) {
        population_shape = population_shape_param_;
        population_scale = population_shape_param_;
        patient_shape = patient_shape_param_;
        patient_scale = patient_shape_param_;
    } else if (flag == 2) {
        population_shape = population_shape_fixed;
        population_scale = population_scale_param_;
        patient_shape = patient_shape_fixed;
        patient_scale = patient_scale_param_;
    } else if (flag == 3) {
        population_shape = population_shape_param_;
        population_scale = population_scale_fixed;
        patient_shape = patient_shape_param_;
        patient_scale = patient_scale_fixed;
    } else {
        reject("unknown flag value: ", flag);
    }

    // Evaluate the patient mean in terms of the gamma random variable for an almost non-centred
    // parametrisatoin.
    patient_mean = exp(population_loc + population_scale *
        log(population_shape ^ 2 * patient_gamma_) / population_shape);
    // Contribution to the patient_loc, evaluated once for efficiency.
    loc_contrib_ = lgamma(1 / patient_shape ^ 2) -
        lgamma((1 + patient_shape * patient_scale) / patient_shape ^ 2) -
        2 * patient_scale * log(patient_shape) / patient_shape;

    // Evaluate contributions to the target.
    for (j in 1:num_samples) {
        // Evaluation the location parameter for this sample.
        real loc = log(patient_mean[idx[j]]) + loc_contrib_;
        if (load[j] > loq[j]) {
            // Account for quantitative observations of the concentration in a sample.
            sample_contrib_[j] = gengamma_lpdf(load[j] | patient_shape, loc, patient_scale);
        } else {
            // Handle left-censoring if the concentration is below the level of quantification.
            sample_contrib_[j] = gengamma_lcdf(loq[j] | patient_shape, loc, patient_scale);
        }
    }
}

model {
    // Sample the latent gamma parameters that induce the generalised gamma prior for the
    // patient mean.
    target += gamma_lpdf(patient_gamma_ | 1 / population_shape ^ 2, 1);
    // Add the contributions from all the samples.
    target += sum(sample_contrib_);
    // Add some standard priors for parameters that are left unconstrained based on the flag.
    if (flag == 1 || flag == 3) {
        population_scale_param_ ~ gamma(1, 1);
        patient_scale_param_ ~ gamma(1, 1);
    } else if (flag == 2) {
        population_shape_param_ ~ gamma(1, 1);
        patient_shape_param_ ~ gamma(1, 1);
    }
}

bbbales2 · October 13, 2020, 12:38pm

Oh bummer.

Cool, so this probably is the issue.

Going back to:

Does this sample better if you parameterize it in terms of inverse shape or inverse scale?

tillahoffmann · October 13, 2020, 1:51pm

I was pondering about parametrising in terms on inverse scale but didn’t go for it in the end. I assume it wouldn’t make much of a difference because the lower=0 limit leads to a log transform to get to the unconstrained space which would then just induce a minus sign. Or did I misunderstand that?

bbbales2 · October 13, 2020, 6:42pm

I have a lot of trouble thinking through parameter transforms. I think it’s better to just try them if there’s any question at all. It’s too complicated to try to stack together the parameterizations + constraints + jacobians + etc. etc. in your head.

Those population shape/scale curves just really look like they’re doing some sort of 1/x thing.

simkovic · December 5, 2020, 7:50pm

Hi Till,

it’s very unlikely that there is a global orthogonal parametrization. Though, it’s straightforward to fix a parameter and to determine two parameters such that they are orthogonal to the fixed parameter but not to each other (Cox & Reid, 1987). One such solution: geometric mean is orthogonal to scale and index parameter. Taking scale as fixed has no analytic solution for replacing the index parameter. I’m currently writing a manuscript with these results.

As you already found out in your simulations the mean is orthogonal to the shape parameter (of the gamma distribution). So you may try to look for the second orthogonal parameter to replace the index parameter.

If you wish to try a parametrization with a geometric mean you will find my implementation of inverse digamma function helpful.

Cox, D. R., & Reid, N. (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B (Methodological) , 49 (1), 1-18.

Richie_M · March 22, 2023, 3:39pm

@tillahoffmann Thanks for demonstrating the advantage (more efficient) of the uncorrelated [shape, mean] over the [shape, scale] parameterization of the gamma distribution. I got that advice from the Stan forum from another member and have adopted it in my gamma distribution study. Aside from this coding and results you’ve shown here, is there another formal reference like a book or article I can cite in my manuscript to substantiate claim that the Stan HMC sample works better with the [shape, mean] parameterization than the [shape, scale]?
Thanks,
Richie_M

tillahoffmann · March 22, 2023, 5:16pm

I’m not aware of a specific reference for the gamma distribution, but there is an example for the beta distribution in the Stan user guide. The section on theoretical background references Gelman, Andrew. 2004. “Parameterization and Bayesian Modeling.” Journal of the American Statistical Association 99: 537–45. (I don’t think I’ve read it yet).

Richie_M · March 22, 2023, 9:25pm

Thanks @tillahoffmann for the response. I looked in the 2004 Gelman paper cited in the Stan manual but that did not really directly addressed correlated parameters.

I found this paper:

which cited:
Rannala B. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst Biol. 2002;51(5):754–760. doi: 10.1080/10635150290102429

I may just also cite this Stan forum URL as your simulation result is gamma distribution specific. For your metric n_eff / n, did you use a large n, close to the magnitude of a n_eff at least 1000 for well-mixed chains)?

tillahoffmann · April 12, 2023, 5:03pm

I have to admit that I don’t quite remember, but I’ll have a look whether I can find the notebook with the experiments I ran; they’re on an older computer that I’ll have to start up. Fit a beta distribution that captures correlation in the fitted parameters? may also be relevant and might have other pointers (although I’ve only skimmed it).

tillahoffmann · June 7, 2023, 3:53pm

@Richie_M I’ve had a dig through my old machine but wasn’t able to find the original code, unfortunately.

Richie_M · June 7, 2023, 5:14pm

@tillahoffmann That’s alright, thanks for digging through.

luchinoprince · March 13, 2024, 3:08pm

I see that parametrizing the gamma with shape and location/mean aids HMC sampling; currently I was parametrizing the Gamma with mean and variance, is such a parametrisation known to be good/bad? Too me it seemed a very logical way to look at it, but maybe I am too biased by the Gaussian distribution.

Best,
Luca

LucC · March 14, 2024, 8:29am

Shape and mean works better for a gamma distribution because for a given shape, the variance scales with the mean, so there is dependency there. Whereas the shape parameter is a sort of measure of the variability relative to the mean.

Topic		Replies	Views
Generalized gamma distributions Developers specification	11	1729	December 7, 2023
Posterior estimates of rate and shape of gamma distribution are dependent Modeling	14	2166	May 17, 2023
Gamma(.., ..)^2 -> generalized_gamma(?, ?, ?) Modeling fitting-issues	23	4487	February 7, 2018
Reparameterisation to remove "banana" from posterior distribution Modeling fitting-issues , specification , performance	7	1424	December 6, 2021
Jeffreys priors for gamma distribution (unknown shape and scale) Modeling	13	2569	February 20, 2023

Parametrisation of generalised gamma distribution amenable to Stan

The short version

The longer version

Context and model description

Related topics