Are there any general guidelines to aid model reparametrization?

Hi, all.

Are there any general guidelines or tips on how to approach model reparametrization? For example, McElreath touches upon the topic of centered and non-centered parametrizations of the normal distribution in chapter 13 of the 2nd edition of Statistical Rethinking, and mentions reparametrizing the exponential distribution. I’ve also found several resources discussing reparametrization, but I don’t think I’ve ever seen it discussed in general, i.e. not talking about a specific distribution (which is usually the Gaussian). However, I’ve been unable to find a more comprehensive treatment of the topic.

So, are there any general approaches to reparametrizing models, that would be (at least somewhat) understandable to someone without a degree in mathematics?

1 Like

The User’s Guide is probably one of the best sources I think.

1 Like

Hi!

Thanks for the suggestion. While it is an informative source, I find it to be more technical and specific than what I am looking for.

Reparameterization is most often due to computational aspects rather than mathematical. Because computational aspects depend on the sampling algorithm, warmup adaptation, numerical precision, etc. there typically isn’t a gold standard parameterization that will work for all problems. On the forums here the implicit assumption when talking about reparameterization is for use with HMC and more generally gradient based sampling.

@Bob_Carpenter has been working on new samplers and can speak a bit more about how he’s designing gradient based samplers that alleviate some of the burden on the user to reparameterize. This is by having clever ways to explore tough posterior geometries.

And posterior geometry is really what this is about. If you can make your parameters uncorrelated and Gaussian shaped (not thick tailed) derivative based sampling with finite precision computers will happily explore. Some of the nasty posterior geometry is structural, this is where reparameterization helps if you can do it, other times it is an identification problem where more data or constraints pin down the parameters removing mode switching or an equally valid model by swapping between one or more parameters.

I don’t know as much about non gradient sampling. I know they can be less general but can sample certain models that HMC and gradient samplers cannot. I don’t know how reparameterization helps here. But that’s an interesting research area if anyone wants to report back.

In a hierarchical models course I gave I talk about “orthogonalizing” parameters from exponential family distributions as one useful reparameterization. Meaning to parametrize so that the expected Fisher information matrix is diagonal. Sometimes you can do this before expectation, which is nice, but in expectation is still often better than strongly correlated parameters.

5 Likes

Hi.

Thanks for weighing in.

If you can make your parameters uncorrelated and Gaussian shaped (not thick tailed) derivative based sampling with finite precision computers will happily explore.

So, in general, we’re “simply” looking for a way to decouple the parameters from each other? E.g. McElreath (Stat. Rethinking, 2020, p. 424) has an example where

\alpha_j \sim N(\bar{\alpha}, \sigma_{\alpha}) \\ \bar{\alpha} \sim N(0, 1.5) \\ \sigma_{\alpha} \sim \mathrm{Exp}(1)

is reparametrized as

\bar{\alpha} + z \times \sigma_{\alpha} \\ \bar{\alpha} \sim N(0, 1.5) \\ \sigma_{\alpha} \sim \mathrm{Exp}(1) \\ z \sim N(0, 1).

So, in the first parametrization, HMC has a hard time sampling \alpha_j because it is tightly coupled with \bar{\alpha} and \sigma_{\alpha}? In other words, we’ve in effect created a correlation between the distributions of \bar{\alpha} and \sigma_{\alpha}, and the resulting multivariate distribution (that of \alpha_j) has a weird shape that’s difficult to sample?

In the second parametrization, on the other hand, \alpha_j has been broken into the independent distributions of \bar{\alpha}, z and \sigma_{\alpha}, which are all easier to sample?

I have completely accidentally stumbled upon a paper in my Zotero library (Papaspiliopoulos, O., Roberts, G. O., & Sköld, M. (2007). A General Framework for the Parametrization of Hierarchical Models. Statistical Science, 22(1), 59–73) which has the figures below illustrating centered and non-centered parametrizations. Seems to be in line with what you wrote, if I understood both correctly.

They also list some “tricks” for reparametrizations:

1 Like

Unfortunately it is a bit more complicated. It’s not just a case of centered is worse than noncentered. It depends on how much data you have to inform. A good treatment of this is @betanalpha’s case study Hierarchical Modeling.

He goes through 3 cases the centered, non-centered, and mix-centered. However, I derived a fourth with a continuous mixing weight that I call partially centered. It interpolates between high and low data areas but you must supply the mixing weight because it’s unidentified within the Stan model (so you determine what it means for there to be a little or a lot of data for each group).


that’s from my slides at GitHub - spinkney/hierarchical_model_tutorial: Hierarchical Models Tutorial StanCon 2024.

3 Likes

Thanks for the additional info. I didn’t mean to imply that either one is worse per se. But reading back, I do realize that it may have come off that way.

As for Michael Betancourt’s treatment of the topic - I have read that case study some time ago, and while I greatly appreciate his work (the depth, comprehensiveness, and the fact that he shares such incredible resources for free) I do sometimes find it to be overwhelming (again, coming from someone without a mathematical background). Additionally, as I’ve mentioned earlier, this is also dealing with reparametrizing a normal distribution, and I was wondering whether there was a general approach to (non-)centered parametrizations.

1 Like

Just for clarity - parameters can be “uncoupled” in the prior distribution (e.g. the non-centered parametrization decouples the prior for the parameters) but still be “coupled”/”correlate”/… in the posterior for a specific dataset (and vice versa). For HMC, the shape of the posterior is what matters (the closer to independent normals, the better). Often, uncoupling in the prior leads to nicer posteriors (especially in cases where you don’t have a ton of data], but that’s just a useful heuristic, not a fundamental property of reparametrization.

Unless you are willing to do the hardcore math thing, I think reparametrization is best explored through examples - we list some at How to Diagnose and Resolve Convergence Problems – Stan , specifically (beyond non-centered parametrization):

  1. Non-centered parametrization for the exponential distribution
  2. Stan users guide chapter on QR reparametrization for linear models
  3. Identifying non-identifiability - a sigmoid model shows an example of where the parameters are not well informed by data, while Difficulties with logistic population growth model - #3 by martinmodrak shows a potential reparametrization.
  4. Reparametrizing the Sigmoid Model of Gene Regulation shows problems and solutions in an ODE model.
  5. Multiple parametrizations of a sum-to-zero constraint.

Hope some of those are helpful

Unfortunately, my experience is that developing useful reparametrizations is quite hard and typically requires substantial mathematical insight into the model-data combination at hand. Reparametrizing more “empirically” (e.g. looking at a pairs plot and trying to guess changes that would decorrelate the pairs) has almost never worked for me.

UPDATE:

Just rembered another nice example of reparametrization: Previously, Stan parametrized the simplex with a stick-breaking transform (see Constraint Transforms in version 2.36), but now it uses ILR (inverse softmax) see Constraint Transforms in current version which should work better. I don’t understand the problem very deeply, but my guess is that the advantage of the ILR is that it is both computationally more efficient and symmetric (insensitive to the order of the unconstrained parameters).

5 Likes

Hi!

Thanks a lot for your input. I’ll try checking out the resources you’ve provided as I progress through my current model.

Also, thank you for this:

Unfortunately, my experience is that developing useful reparametrizations is quite hard and typically requires substantial mathematical insight into the model-data combination at hand.

I’m, theoretically, more than willing to dive into the mathematics, but there’s only a finite amount of time in a day. This question about reparamatrizations was really starting to bug me, mostly because I wasn’t sure whether I’m missing something obvious (like a really clear paper that explains everything), or it just isn’t a simple topic. I feel like some of the expositions on the topic have that “then you just do this and that’s it”, while in reality it’s the “draw the circle, draw the rest of the owl” for someone not deeply steeped in the topic. So thank you for clearing that up :D

With reparameterization, we’re trying to reparameterize so that the reparameterized density is close to standard normal. That is, no correlation, unit scale, centering, and Gaussian tails.

In theory, this is easy. Sample from a standard normal, apply the normal cdf to transform to [0, 1]^N, then apply the inverse CDF of the multivariate posterior and you’re done. That leads Stan to sample the standard normal, but transform to the target density. The only problem (:-)) is that we can’t write down the inverse CDF of the posterior. So we take baby steps that do things like decorrelate and unit scale the parameters where we can.

Forgetting about funnels, just think about a Cauchy distribution. It doesn’t even have finite means or variance it has such fat tails and does not sample well with HMC. One way to sample it would to to sample from normal(0, 1), then apply the normal cdf (Phi) and then apply the Cauchy inverse cdf, which is analytic (cf. Cauchy distribution - Wikipedia). Then the sampler is working over standard normal and all the work is done by the inverse cdf. This is a reasonable way to code a Cauchy, but you can also just sample from uniform(0, 1) in Stan and transform—the transform we put on (0, 1) implies that we are really sampling form a standard logistic.

That’s why the centered parameterization is good in some cases and not others, and why decorrelating is always part of the goal. What happens with a hierarchical model is that with no data, it looks like a funnel in the posterior, but with lots of data it looks roughly normal. So you only want to use the non-centering reparameterization when the data plus prior are not very informative about the posterior.

Another non-funnel example you might find useful is the reparameterization of the beta distribution into a mean and total concentration beta2(y | mu, kappa) = beta(y | mu * kappa, (1 - mu) * kappa). This turns out to work way better for hierarchical models because it decor relates alpha and beta in beta(alpha, beta). Andrew talks about this in the first hierarchical model example in chapter 5 of Bayesian Data Analysis (free pdf on the book’s home page).

4 Likes

For completeness, there is at least one other important reason to reparameterize, and that is when your desired prior is easier to express in an alternative parameterization.

For example, consider a model that includes two variance components, parameterized by their standard deviations \sigma_1 and \sigma_2. In some scenarios, it might be natural to express your prior belief in terms of the total combined variance, with square root \sigma_{total}, and a parameter \omega on the unit interval that gives the proportion of the total variance belonging to the first component. If so, the path of least resistance might be to parameterize in terms of \sigma_{total} and \omega rather than in terms of \sigma_1 and \sigma_2. This might be true irrespective of the details of the posterior geometry, though in practice it’s often the case that parameterizations that are amenable to expressing your prior beliefs tend to also be those with “nice” posterior geometries.

3 Likes

Hi.

Thanks for weighing in to both of you

@Bob_Carpenter, this sounds interesting:

So if we have a distribution that is “odd” (e.g. deviating a lot from the Gaussian?), sampling from an uniform distribution and then doing a transform of the sampled value might speed up processing?

Yes!

There’s a discussion of the Cauchy case in the Efficiency Tuning chapter of the User’s Guide:

The inverse cdf has a simple form, so it’s easy to code.

2 Likes