Soft constraining a product of parameters (pushforward distribution) with a prior when there are also priors for individual parameters

Dear Stanimals,

Just as I thought I got the hang of dealing with Jacobians, I’m facing a new interesting challenge where these babies may (not?) turn up:

In my model, I want to include prior information on a function of multiple parameters, in addition to marginal priors for the individual parameters. At the moment, I can’t figure out whether including this prior for the function of parameters also means I need to deal with the Jacobian (and if so, what that would look like).

I am trying to model extremely overdispersed data that has three hierarchical levels: individuals, repeated samples per individual over time, and repeated testing of the same sample. Each of these three levels is associated with considerable variation (coefficients of variation in the order of 0.25 to 4.0), which together create the overall extreme level of overdispersion. My aim is to decompose the variance into the three levels with a suite of candidate models that assume different distributions for the variation at each level (e.g., gamma or log-normal). To be able to use the same priors for each model and to fairly compare the models, I am parameterising the variance structure of each model in terms of the coefficient of variation (CV) associated with each hierarchical level: CV_1, CV_2, and CV_3. Within the model, these are transformed to parameters appropriate for the distribution being used (e.g., shape parameter for gamma, SD parameter for log-normal). Easy-peasy. But now, in addition, I want to soft-constrain the total coefficient of variation (CV_{total}), as even with very reasonable priors on CV_1, CV_2, and CV_3, CV_{total} can very quickly get out of hand (or has tails that are much wider and fatter than reasonable):

CV_{total}=\sqrt{(CV_1^2+1)(CV_2^2+1)(CV_3^2+1)-1}

So, given priors for CV_1, CV_2, and CV_3, how should I implement the addition of information on CV_{total} in Stan? Can I just state target += some_prior_lupdf(CV_total | ..., ...) to my model code with impunity or do I need to handle Jacobian business as well (and if so, what on earth would that beast look like)?

This thread (Do I need Jacobian adjustment here?) seems relevant. It would be pleasing if you could specify priors for CV_{1:3} that also satisfy the information you have about CV_{\text{total}}, but I think you’d have to specify an appropriate correlation structure between CV_{1}, CV_{2}, and CV_{3}, which I personally find impossible to do.

Thanks! I remember reading that thread a while ago, but could not find it this time. Pushforward distribution was the term I should have looked for.

I did consider a multivariate prior + correlation structure for CV_1, CV_2, and CV_3. And I agree that is highly unpractical and not really feasible.

Some alternative ideas:

  1. Specify priors for CV_{total} and only two elements of CV_{1:3} and derive the third of CV_{1:3}. Drawback: this will likely lead to inadmissable values (e.g. <0) for the derived component.
  2. Could it be possible to define some sort of “trick” based on a prior CV_{total} and some sort of simplex parameter (of length 3) or a trick with a softmax-like approach to derive CV_{1:3}? My gut feeling is that it should be possible to put a prior on an auxiliary parameter \theta (support \geq 1) such that CV_{total} = \sqrt{\theta - 1}, and then have some sort of stick-breaking process / simplex / softmax-like trick to decompose \theta into CV_{1:3}. @betanalpha, care to weigh in? Am I really banging my head against the same stone as described in the post linked by @hhau?

Section 3 of [1406.5881] Constructions for a bivariate beta distribution derives a trivariate beta-like distribution, which is what I think you’d want for the second trick there (and manually tweaking the eight \alpha values might be easier than finding an appropriate correlation structure).

One hacky way might be to sample p(CV_{1}), p(CV_{2}), p(CV_{3})p(CV_{\text{total}}) using MCMC, compute and use the empirical covariance you get between the components as a starting point for a multivariate prior for p(CV_{1}, CV_{2}, CV_{3})? No idea if this will actually work, but Stan makes the sampling easy enough.

Hm, I’m loath to work with so many correlation parameters. I have a different train of thought: let’s stick with that idea of defining an auxiliary parameter \theta=CV_{total}^2+1, so we can define a prior for CV_{total} and work with \theta to derive the individual CV_i. Let’s specify a Dirichlet prior on a simplex parameter \kappa_{1:3} of length 3 (way fewer parameters: 3 instead of 8), and then let:

CV_{total}=\sqrt{(CV_1^2+1)(CV_2^2+1)(CV_3^2+1)-1}

CV_{total}^2+1=(CV^2_1+1)(CV^2_2+1)(CV^2_3+1) = \theta

Now let: \theta = \Pi_i \theta^{\kappa_i}, where \theta^{\kappa_i} = CV_i^2+1

Because \Sigma_i \kappa_i = 1, it follows that \Pi_i \theta^{\kappa_i} = \theta^{\Sigma_i \kappa_i}= \theta^1, which is \theta. And because \kappa_i \geq 0 and \theta \geq 1 by definition, it follows that CV_i^2+1 \geq 1, which is exactly the condition we need to be able to safely derive CV_i=\sqrt{\theta^{\kappa_i}-1}=\sqrt{(CV_{total}^2+1)^{\kappa_i}-1}.

I think this may be it!! This is a way (maybe the most straightforward?!) to soft-constrain the product of a bunch of parameters. I’d rather work with those few hyperparameters of the Dirichlet distribution to get the desired pushforward distribution for each CV_i than dealing with N(N-1)/2 correlation parameters to tweak a pushforward distribution for the product of N parameters. Am I missing something?

1 Like

Seems very reasonable – how easy is it to translate the information you have about CV_{\text{total}} to CV_{\text{total}}^2 + 1?

Trivial, I would say. I could even just define a parameter and prior for CV_{total} and then define \theta=CV_{total}^2 +1 in the transformed parameters block or even locally in the model block.

Thanks for sparring! Often, somewhat less attractive alternative ideas are just the push you need to come up with a nice solution. So, cheers!

1 Like

EDIT: had to debug the code (was almost there) - it’s working now. See attached an R script that provides a sense check.

Decomposing the overall CV into individual hierarchical components works. And each CV component can be translated to e.g. a shape parameter (k) of a gamma distribution or the sigma parameter of a log-normal distribution.

Specifying a uniform prior on simplex \kappa seems to be reasonable as well (note: I implemented the Dirichlet prior in R via independent gamma distributions - in Stan I’m just using the dirichlet distribution directly).

explore_CV_prior.R (3.8 KB)
image
image

I think this is reason for so many prior modeling grievances, but it’s an unavoidable part of prior modeling! The grievance only comes from a false expectation that it should be easier. :-)

Specifying a prior model only through marginal behavior is not a unique default, instead it’s just one very simple assumption of the possible multivariate behavior! Because it proves to be sufficient in many cases it’s often taught as the way to build prior models, which then drastically restricts the domain expertise that one can actually encode in a prior model.

Domain expertise that manifests in particular behavior of a prior pushforward distribution as well as the marginals almost always implies that some nontrivial multivariate structure is required. The challenge as always is identifying what the right structure is.

Coefficient of variation can be tricky to work with because of the mean normalization, but there are lots of interesting prior models for the scales/variances. Distributing some total scale or variance (often variance because independent variances naturally add) to each hierarchal contribution through a simplex is pretty common approach, which suggests that the method considered here is reasonable.

That said keep in mind the procedure here. The available domain expertise informs marginal behavior of the parameters and the behavior about their sum. Fixing the sum turns the problem around into one of distribution, which naturally leads to simplices and Dirichlet models. You’ve found a reasonable multivariate structure that wasn’t too highly impractical nor infeasible!

1 Like