Reparameterisation to remove "banana" from posterior distribution

I have a relatively simple model with a multiplicative term leading to a non-identifiable “banana” shape in the posterior. In short, I have a global parameter 0\leq\alpha\leq1 and group-level parameters 0\leq\eta_i\leq1 for each of k groups. Then I generate n samples x_{ij} from a normal distribution for each group i with mean \mu_i=\alpha\eta_i and known variance \sigma^2. The Stan model is

data {
    int<lower=1> n, k;
    real<lower=0> sigma;
    matrix[k, n] x;
}

parameters {
    real<lower=0, upper=1> alpha;
    vector<lower=0, upper=1>[k] eta;
}

transformed parameters {
    vector<lower=0, upper=1>[k] mu = alpha * eta;
}

model {
    alpha ~ beta(1, 1);
    eta ~ beta(3, 3);
    for (i in 1:k) {
        x[i] ~ normal(mu[i], sigma);
    }
}

Sampling from the posterior distribution gives a “banana” in \alpha-\eta_i space as one might expect (see left panel of figure below), including some divergences because of the nasty shape. Sampling in the \alpha-\mu_i space (see right panel) would be much nicer because the parameters \mu_i are pinned down by the data x.

eta-alpha

Suppose we parametrise the model in terms of \alpha and \mu_i instead. Then we can readily obtain \eta_i=\mu_i/\alpha. However, for \eta_i to be in the unit interval, we require that 0\leq\mu_i\leq\alpha which doesn’t seem like a trivial constraint to impose. We could of course change variables to a variable on the unit interval and multiply by \alpha to satisfy the constraint–but that’s exactly the variable \eta_i we’re trying to get rid off.

Any ideas on how to reparameterise this model would be much appreciated. A reproducible example (using pystan) can be found in this gist.

An aside: reparameterising in terms of \eta=1/\xi for 1\leq \xi is slightly nicer because we end up with a linear correlation between \xi and \alpha (rather than a banana), but it’s still not great.

xi-alpha

1 Like

So I guess you realize that this model uses k+1 free parameters to define k \mu's, and so there are always going to be identifiability issues here unless you resort to strong priors. With that said, and not knowing your exact use case, let’s see if we can figure out how to reparameterize to make this space easier to sample from.

I think the trick will be to switch from \mu = \alpha\eta to log(\mu)=log(\alpha)+log(\eta). Now we need to sample under the constrains that log(\alpha) and log(\eta) are both negative. I haven’t checked myself, but I wonder if simply parameterizing in terms of upper-bounded log(\alpha) and log(\eta) is already helpful. There should still be strong posterior correlations, but they might not be banana-shaped, and conceivably could be amenable to sampling using a dense metric.

Edit to add:
I think the important practical question here is whether your goal is to actually do inference on \alpha and \eta using principled priors, or whether your priors are chosen somewhat more arbitrarily. There’s no information in the data to distinguish these quantities, so either you are putting a lot of stock in your prior model, or you simply cannot do inference on these particular quantities. If the goal is to use principled priors, then one route to taming the banana, (ONLY if defensible on prior grounds) would simply be to strengthen the priors.

1 Like

Yes, agreed that there’s always going to be some problem with identifiability here. I was hoping to make progress akin to the non-centred parametrisation of Neal’s funnel where the scale parameter is only weakly identified (leading to awkward shapes in the posterior distribution). The change of variables makes the sampling easier, but the marginal parameter for the scale parameter remains broad.

Thank you for the suggestion; it definitely looks nicer in the constrained space (code below; gist updated), but the sampler struggles. Gut feeling: the unconstrained space isn’t particularly nice because we have two exponential transforms to get from unconstrained to alpha/eta–one implicit in imposing upper=0 on log_alpha and log_eta and one explicit to get from log_alpha to alpha and log_eta to eta. I had to increase the target acceptance probability to 0.99 to get the sampler to behave.

logeta

data {
    int<lower=1> n, k;
    real<lower=0> sigma;
    matrix[k, n] x;
}

parameters {
    real<upper=0> log_alpha;
    vector<upper=0>[k] log_eta;
}

transformed parameters {
    real<lower=0, upper=1> alpha = exp(log_alpha);
    vector<lower=0, upper=1>[k] eta = exp(log_eta);
    vector<lower=0, upper=1>[k] mu = alpha * eta;
}

model {
    alpha ~ beta(3, 3);
    eta ~ beta(1, 1);
    for (i in 1:k) {
        x[i] ~ normal(mu[i], sigma);
    }

    // Account for the change of variables using the log jacobian.
    target += sum(log_eta) + log_alpha;
}

There’s not always a viable reparameterization, especially when constraints are involved, but there are other potential resolutions. See for example the discussion in Section 5.3 of Identity Crisis.

All of that said a more effective approach might be to consider a more appropriate model for the assumed constraints, for example the location-disperation parameterization of the beta family, https://betanalpha.github.io/assets/case_studies/probability_densities.html#242_the_location-dispersion_parameterization.

Thank you for the pointers, @betanalpha. Section 5 of your identity crisis post was indeed what motivated the search for a parametrisation (e.g. based on \mu and \alpha rather than \eta and \alpha) in which each parameter is more directly identified by the data, attenuating dependencies in the posterior. But, as you mention, the constraints make an approach akin to the reparametrisation in section 5.2.1 difficult.

Understood that location-scale parametrisation of the beta distribution (and similar for the gamma distribution) are more easily identified by data on the unit interval (positive reals). Unfortunately, I couldn’t quite see how this insight can help tackle the problem above. Any additional insights would be much appreciated.

2 Likes

@tillahoffmann.

  1. As everyone else is saying, this model is fundamentally overparameterized. IRT models solve their multiplicative non-identifiability by constraining the scale of one of the variables. For instance, if you take
a ~ normal(0, 1);
b ~ normal(0, sigma);
c = a * b;

Then the scale of c is determined by sigma.

But I don’t see how to do that with (0, 1)-constrained parameters.

Also, you can look at it as an additive non-identifiability of log_alpha and log_eta. So if you want to work with priors directly on those, then you can center one around zero and let the other drift. It’ll just convert the strong non-identifiability into a weaker form.

  1. Stan lets you avoid all the transforms and Jacobians by constraining the parameters (though it’s a slightly different transform than what you are doing, the result is the same distribution on alpha and eta).
parameters {
  real<lower = 0, upper = 1> alpha;
  array[K] real<lower = 0, upper = 1> eta;
}
transformed parameters {
  alpha ~ beta(3, 3);
  eta ~ beta(1, 1); // this is a no-op
  x ~ normal(alpha * eta, sigma);
}

Thank you for the additional insights, @Bob_Carpenter.

Understood that this model is over-parameterised (and the posterior is thus not great for sampling). I should’ve been more specific: I’m interested in approaches to sample this distribution more efficiently rather than making the individual parameters identifiable (which, as you point out, could be achieved by imposing tighter priors on \alpha, for example).

I’ve been thinking about this problem as analogue to the non-centred parametrisation of Neal’s funnel: the posterior in the original space is still nasty. The reparametrisation only helps us sample it more efficiently (rather than ramping up adapt_delta very close to 1 in the original parametrisation). But maybe that’s just tricky given the constraints.

One of the common confusions in hierarchal modeling is that the centered and non-centered parameterizations apply to only location-scale population models, which are themselves limited to unconstrained spaces. Consequently the strategies for implementing hierarchical models on these unconstrained spaces will not carry over to hierarchical models on constrained spaces. Some families of probability density functions over constrained spaces are also closed under certain transformations, but they are in general much more complicated than the centered and non-centered parameterization.

Let me explain what I mean by a more appropriate model. Your constraints limit the mean to the unit interval, while the normal variation allows observed values to leak outside of the unit interval. I’m going to assume that the leaking is incidental; in other words that the observed values of x are also constrained to the unit interval.

One natural model for unit interval constrained observations is the beta family, in particular the beta family with a location-dispersion parameterization. Here we model a location parameter \mu \in [0, 1] that defines the mean and desperation parameter \phi \in [0, 1] that determines how concentrated the densities are around that mean.

If the locations are heterogeneous then we might want to model them hierarchically, but hierarchical population models for constrained spaces are subtle. Another option is to model the heterogeneity in an unconstrained transformation of the locations rather than the locations directly. For example the logit function maps the (0, 1) interval to (-\infty, \infty) so that we can model

\lambda_{k} = \text{logit}(\mu_{k})

with a normal hierarchical model and apply all of the parameterization tricks to optimize the posterior geometry.

The final model can then be written as

\begin{align*} \lambda_{k} &\sim \text{normal}(\omega, \tau) \\ \mu_{k} &= \text{logistic}(\lambda_{k}) \\ x_{n} &\sim \text{normal}(\mu_{n(k)}, \phi), \end{align*}

with principled priors over everything as well.

Of course if the original multiplicative form is derived from some physical process, and not just a heuristic model of heterogeneity, then the above approach wouldn’t be appropriate and you’ve have to tackle the multiplicative degeneracy more directly. The problem specification just smells a bit off to me, hence the suggestion for an alternative formulation.