Parametrization for 6 non-independent variables in (0,1) and (0,∞)

I have a set of individual-specific parameters in a hierarchical model, specifically \alpha_{i,1}, \alpha_{i, 2}, \beta_{i,1}, \beta_{i,2} \in [0,1], x_{i,1}, x_{i, 2} > 0. These are not observed directly, but play a role in the likelihood of individual data.

I am struggling to find a good hierarchical prior for this, for the following reason: while the \alpha's and x's are interior, \beta's are interior for some individuals while they are more consistent with \beta \approx 0 for some individuals. All four variations occur in the data: \beta_{i,1} \approx 0 while \beta_{i,2} interior, etc.

The way I am parametrizing it now is taking a multivariate normal distribution, getting \alpha, \beta's from a logit transformation while x's from a log (excerpt from stan code follows, the whole model is a 500-line monster that no one would read). You can see why this is problematic: it cannot deal with mass around 0 in \beta's.

This is giving me a lot of convergence issues, ie maxed out treedepth and divergences. Suggestions would be appreciated: how would you model this?

Stan code excerpt
functions {
  /* logit(x) transformation and log Jacobian correction for parameters.
  real logit_lp(real x) {
    target += -log(x) - log1m(x);
    return logit(x);

  /* Cross sectional distribution of individual-specific parameters.
     Transformed parameters ~ multi_normal_cholesky(cs_mean, cs_L)
     xs > 0, 0 < alphas, betas < 1.
  void cross_sectional_distribution_lp(vector xs, real alpha1, real alpha2, real beta1, real beta2,
                                       vector cross_section_mean,
                                       matrix cross_section_L) {
    vector[6] z;
    z[1] = logit_lp(alpha1);
    z[2] = logit_lp(beta1);
    z[3] = log(xs[1]);
    z[4] = logit_lp(alpha2);
    z[5] = logit_lp(beta2);
    z[6] = log(xs[2]);
    target += -z[3] - z[6]; // log Jacobian adjustment for y = log(x)
    z ~ multi_normal_cholesky(cross_section_mean, cross_section_L);
parameters {
  /* cross sectional distributions
  vector[6] cross_section_intercept;    // intercept
  vector<lower=0>[6] cross_section_std; // standard deviation
  cholesky_factor_corr[6] cross_section_corr_factor; // correlations (as Cholesky decomposition)
model {
  matrix[6, 6] cross_section_L; // Cholesky factor of cross-sectional variance
  cross_section_intercept ~ normal(0, 5);
  cross_section_std ~ student_t(7, 0, 5);
  cross_section_corr_factor ~ lkj_corr_cholesky(2.0);
  /* initialize common variables
  cross_section_L = diag_pre_multiply(cross_section_std,
  for (i in 1:N) {
    cross_sectional_distribution_lp(xs[j], alpha1[j], alpha2[j], beta1[j], beta2[j],
                                    cross_section_intercept, cross_section_L);

Corresponding KDE (multimodes could be convergence issues, or maybe that’s how the data looks):

If the values really are constrained to be positive, but can also be zero, then you run into trouble with the usual parameterizations because it pushes mass out to -infinity on the unconstrained scale. If you don’t want zeros, you can introduce a zero-avoiding prior, but these usually have to be pretty strong. And they’re hard to mix in with the multi-normal.

[Edit: I didn’t understand why the cross_sectional_distribution_lp code adjusts for log but not for logit. Oops, saw the adjustment in the other function]. I think it would be much easier to work the other way around. Define the distribution on unconstrained parameters then constrain in the code. That way you don’t need any Jacobians (which really get doubled because we implement them for the original constraints).

The other natural approach would be zero inflation. The marginalizations will be tractable because they’re individual level. You might also try more weakly informative priors for cross_section_sd unless you really expect large values there.

The code would be a lot faster if you pushed the loop into cross_sectional_distribution_lp and vectorized the multi-normal Cholesky.

1 Like

Thanks, I went with this and got rid of most of the divergences and maxed out treedepth.

The model is still not mixing super-well but that may be a separate issue, which I will investigate with gradually building up the same model using simulated data.

For this model that is not a good fit, but I will keep this in mind for future models.