Adaptation and non-centered parameterisations

I had some thoughts about non-centered parameterisations and how they can prevent the sampler from adapting to the posterior geometry. And I suspect there might be a simple fix for that.

Suppose we have 100 measurements in 10 groups and want to estimate the mean. That’s a simple hierarchical model with 10 groups and 10 observation in each group. Parameters are the global mean , the group offsets and the standard deviation of the group offsets. Let’s say the dependent variable is poisson-distributed, so we don’t have to worry about estimating the sd of the observation level error. This makes 12 parameters in total.

Here’s some R-Code to create example data:

library(tidyverse)
library(rstan)
library(tidybayes)


set.seed(2)
nGroups=10
nWithinGroup=10
globalMean=10
groupSd=.1

groupOffsets=rnorm(nGroups,0,groupSd)
data=
    crossing(
        groupId=seq_len(nGroups)
    ) %>% 
    mutate(
        datapointId=seq_len(n())
        ,linearPredictor=globalMean+groupOffsets[groupId]
        ,y=rpois(n(),exp(linearPredictor))
    )
data %>%
    ggplot()+
    geom_point(
        aes(
            datapointId
            ,y
            )
        )

And the resulting graphic showing the data:
image

Here’s the code of the stan model:

data {
  int nGroups;
  int nWithinGroup;
  int groupId[nGroups*nWithinGroup];
  int y[nGroups*nWithinGroup];
}

parameters {
  real globalMean;
  real <lower=0> groupSd;
  vector [nGroups] groupOffsets_offCenter;
}

transformed parameters {
  // variable definitions
  real linearPredictor[nGroups*nWithinGroup];
  vector [nGroups] groupOffsets;

  // off-center parameterisation
  groupOffsets=groupOffsets_offCenter*groupSd;
  
  // linear predictor
  for(currDatapointId in 1:(nGroups*nWithinGroup))
  {
    linearPredictor[currDatapointId]=
      globalMean+
      groupOffsets[groupId[currDatapointId]];
  }
  
}

model {
  
  // high level priors
  // use implicit improper uniform priors
  
  // latent terms
  groupOffsets_offCenter~normal(0,1);
  
  // likelihood
  y~poisson(exp(linearPredictor));
  
}

This code fits the model and makes some graphics of the results:

# fit model
post=
    stan(
        "stanmodel1.stan"
        ,data=
            list(
                nGroups=nGroups
                ,nWithinGroup=nWithinGroup
                ,groupId=data$groupId
                ,y=data$y
            )
    )

# extract some results
postExtractions=
    post %>% 
    spread_draws(
        globalMean
        ,`groupOffsets_offCenter[1]`
        ,`groupOffsets[1]`
        ,groupSd
        ,divergent__
        ,treedepth__
        ,lp__
    )

# plot
par(bg="#303030")
postExtractions %>% 
    mutate( # add some spread to the discrete variables to reduce overdraw in the plot
        .chain=.chain+runif(n(),-.3,.3)
        ,divergent__=divergent__+runif(n(),-.3,.3)
        ,treedepth__=treedepth__+runif(n(),-.3,.3)
    ) %>%
    # select( # activate this section to plot only the most interesting variables
    #     globalMean
    #     ,`groupOffsets_offCenter[1]`
    #     ,`groupOffsets[1]`
    #     ,groupSd
    #     ) %>%
    {pairs(
        .
        ,col=hsv({ # colourize the points according to the estimated standard deviation of the group offsets
            a=(.$groupSd)
            a=a-min(a)
            a=a/max(a)
            a=a/3
        })
        ,pch=46
        ,upper.panel=NULL
    )}
par(bg="white")

Here’s the resulting full model diagnostic plot (just for the sake of completeness) (yes, i know there are 4 divergences):

And here a closer look at the most interesting variables:

The samples are colored according to the estimated standard deviation of the group offsets.
Only the first of the group offsets is shown, both before (groupOffsets_offCenter[1]) and after (groupOffsets[1]) the non-centering transform.

The plot shows that In this model there is a strong linear correlation between the estimated mean and the group offsets (the panel on the left in the middle). This should not be a problem for stan, as it can adapt the mass matrix to take that into account. However, since I have introduced a non-centered parameterisation for the group offsets, stan this is not actually the posterior shape that stan sees internally, instead what the sampler has to deal with is the shape shown in the panel in the upper left. The coloring of the samples shows that the correlation between the mean and “groupOffsets_offCenter[1]” depends on the estimate of “groupSd”. As far as I’m aware stan cannot adapt to such a situation, where the correlation between two parameters changes depending on a third parameter.

(I suspect this is why this particular example model samples badly and throws divergences. And indeed, the model can be “repaired” by reparameterizing “globalMean” such that is is no longer correlated with the group offsets.)

I think that there are many models out there where variables are simply linearly correlated, and where the non-centering transform changes this into a type of correlation that stan can’t adapt to. On the opposite I suspect that the number of cases where the non-centering changes the correlation structure into something that is more favorable for the adaptation is much smaller in practice. Granted, I have selected this example dataset specifically to pronounce the issue, but I still think it’s a common problem in practice.

This lead me to the core idea of this post. Since stan nowadays has dedicated syntax for declaring non-centering transformations (as multipliers in the variable declarations in the parameters block), it is possible to change the way these are handled internally.
My Idea was to make it such that the effect the multiplier has on the correlations is respected by the sampler, possibly as a per-parameter opt-in setting declared in the parameters block if there are concerns about applying it globaly (e.g. because of backwards compatiblilty).

I’m not deep enough into the math to be sure how simple an implementation would be, if possible. But I suspect it could be very simple. I had the though that maybe simply multiplying/dividing the row/column of the mass matrix (except for the diagonal element) that corresponds to the non-centered variable with its multiplier might already be enough. But I could be on the wrong track with this.

So my questions are:
Is it possible?
Do you think it is valuable?

edit: I should note that I’m making a few assumptions about how Stan works in this post, and I’m not completely 100 % sure about their correctness, just 99 %. If any of those are wrong I apologise and would welcome corrections.

(Some background info on my motivation to write this post:
I often encounter this issue in the form of correlations between group offsets / gaussian fields / ect. and observation level errors. Recently I’m stuggling with it in the form of correlations between the two parallel markov chains in a structural timeseries model (one chain representing random fluctuations one representing a systematic trend). The latter example causes me a lot of headache at the moment, because I need the non-centered parameterisations, but couldn’t find a reparameterisation that removes all the offending correlations.)

2 Likes

Here, as an extra post to avoid confusion:

Another example, slightly more “beautiful” but a more complex model so I didn’t want to show it in the main post. Its the almost the same model as above, but I added some overdispersion to the data and added an observation level error to the model to account for that. The globalMean parameter has been set to a fixed value in this example because it would otherwise obscure the pattern I wanted to show.

The plot shows the correlation between one of the group offsets (as above both in the centered and non-centered form) and one of the observation level errors. The panel in the middle is what the sampler would have to deal with were it a centered parameterisation (pretty much just a simple multivariate gaussian), the panel in the middle left shows what the sampler actually has to deal with due to the non-centering.

As with the model fomr the opening post, this one samples badly and throws divergences. And again as in the opening post, the model can be repaired by reparameterizing the observation level errors to remove the correlation with the group offsets. This is done by introducing a parameter observationOffsets_offCenter and defining
observationOffsets=observationOffsets_offCenter-correspondingGroupOffsets, which is basically the same as what I did to repair the model in the opening post (but didn’t explain there).

1 Like

I want to add that I didn’t quite hit the tone I intended when I wrote the above posts. Unfortunately it’s no longer editable, so I cant change it.
I’m aware that most of the devs here know way more about Stans inner workings than me, and that because of this it is probable that I’m wrong with some part of my post, because otherwise what I suggested would likely already have been implemented.

Edit:
Hmm, when I redid the testing of model in the first post today (because I wasn’t sure whether I had tested it enough with dense metric, as that codeline was deactivated in the code I posted) I got different results. The “repaired” model now doesn’t fare any better then the original one. The only thing I changed was to increase warmup. Sooo, I’m no longer sure if my original idea makes sense, as my predictions from that idea lo longer conform to my test results. I didn’t retest the model from the second post.

2 Likes

Hi,
I actually encountered a few similar problems. I don’t understand your proposal very well, but I think it is better solved by modifying the models instead of adding some additional syntax to Stan. The two things that usually helped me are:

  • Using centered parametrisation for (parts of) the model, especially when there are a lot of data points observed from the group (the reasons why this helps is explained at Hierarchical Modeling)
  • Enforcing a sum-to-zero constraint on the random intercepts (either soft or hard as discussed at Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain. This changes the interpretation of the model and I have yet to completely understand exactly how, but it seemed quite sensible for some of the use cases I worked with. Also the R-INLA package actually does this as default and the authors seem to know what they are doing :-)
  • Both of the above.

Would that fit your intuition about the problem?

2 Likes

Whoa, new to me! Time to delve…

Let me clarify a few things again in case they aren’t, just to be sure:

  1. The models in the first and second posts were only created to isolate and illustrate the problem and are not models I’m trying to “get solved”.
  2. Since I found out that the test with these model apparently doesn’t even isolate/illustrate the problem correctly, they are kind of pointless right now. I’d have to redo the test and find better example models, that actually show the effect (unfortumately I’m not sure if I have the time to search for a better test at the moment, coming up with this stuff already took quite long and I hadn’t anticipated such big problems anymore).
  3. Because of 2, the only thing that remains to be discussed here at the moment is the Idea I had that promoted me to start the thread. Since the testing failed I’m at the moment not sure how important and/or correct this idea is. So I guess I’ll try to explain this idea better, and skip the testing for the moment. I suspect something like this has already been discussed and probably also tested at some point.

At the end of the first post I mentioned the real world reason why I started thinking about the things which ultimately led to me opening this thread. It’s a structural time series model. The distribution of the transitions between the (continous) latent states is assumed to be ~normal(0,x), where x is a parameter to be estimated and can get arbitrarily close to zero. While trying to get this model working I got the impression that it is not possible to solve the resulting funnel problem without using a non-centered parameterisation. But because this introduces a complex interrelation between x, the latent states and other model variables (the correlations between the “unconstrained” representation of the states and the other variables changes depending on x) it seemed using such an non-centered parameterisation introduces too many other problems. I could not find a remarameterisation that avoids these, and there is a theoretical reason why no simple reparameterisation (of the types I considered) can exist that is able to solve this problem. While thinking about this dilemma I had the Idea that if the non-centered parameterisation would just not introduce a dependency of the parameters correlations on x, then these issues wouldn’t come up. I was wondering if it is possible to do this, which is why I started this thread.
This led to the Idea of using x to change the mass matrix in such a way that it either counteracts the x-dependency of the parameter-correlations resulting from the non-centering, or to implement the non-centering (or something else that serves the same purpose) in a way that doesn’t create that x-dependency in the first place.
For example if instead of non-centering the state transitions via x, x would the change mass matrix entries that correspond to the state transitions, this might solve the funnel problem without messing with the parameter-correlations. Or at least it might help with the divergenges resulting from the funnel, but not with the problem that x would be strongly correlated with energy (as in an ordinary centered parameterisation), which hinders exploration.
I’m not sure if this explains my thoughts well enough. If there’s still confusion I need to come up with a less messy way to explain it.

I opened this thread mainly to talk about the above mentioned idea about alternatives to the current way of dealing with funnel geometries. Nonetheless, your tips might prove valuable to get my structural time series model working properly, thanks!
I tried all kinds of combinations of centered and non-centered, but will review the theory you linked and check if it motivates a configuration I haven’t tested yet.
The thing about sum-to-zero-constraints is an interesting point. I’ve been thinking about these things a lot with this model, but from a different perspective. You gave me some ideas to investigate.

2 Likes

Stan’s dynamic Hamiltonian Monte Carlo sampler can only work with the target density coded in a Stan program. Once that Stan program has been specified there isn’t any information the sampler itself can take advantage of to modify the target density, at least not until the sampler has actually been run and been able to explore the target distribution a bit which is where adaptation kicks in.

Reparameterizing a Stan program changes the target density, and hence how Stan’s dynamic Hamiltonian Monte Carlo sampler interacts with that target density. Well-chosen reparameterizations can yield better interactions which then yields increased computational performance, while poorly-chosen reparameterizations can degrade performance. The key here, however, is that the parameterization is part of the information that a Stan program gives to the sampler. In other words it’s an opportunity for the programmer to give the sampler better information that can yield better performance.

You’re correct that non-centering all of the hierarchical parameters, what I call a “monolithically non-centered parameterization”, does not always yield the best behaved posterior density function. Depending on how informative the individual likelihood functions are, all, or even just some, of the hierarchical parameters may be better off in a centered-parameterization. I discuss the possible good and bad behaviors in Section 3 of Hierarchical Modeling, and provide some programming strategies for handling mixed parameterizations in Section 4.2.3 of https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html#423_Unbalanced_Likelihood_Functions.

As long as my hierarchal modeling case study is, I don’t even consider all of the possible parameterizations. When using a normal population model one can center and non-center both the population scale and the population location; while these are often centered and non-centered together they can be implemented separately. This is often done implicitly in factor models see for example the discussion in Section 1.4 of https://betanalpha.github.io/assets/case_studies/factor_modeling.html#14_Residual_Hierarchical_Models. Finally there are also partially centered parameterizations that interpolate between a centered and non-centered parameterization. These are typically not needed for normal population models but they can be useful for other location-scale population models; see for example the absolute mess that is trying to fit the horseshoe model in Section 2.2.2.3 of Sparsity Blues.

Neat trick! If one is uncertain as to which contexts should be modelled as c/nc, presumably it makes no sense to implement a mixture model, eh? The degenerate topology would still be in the expanded parameter space and since it’s not identifiable from its partner there’d be no means by which the mixture proportion parameter could bias to the non-degenerate component, right?

A mixture model like this becomes really big really quickly – K \choose 2 components for K contexts with unknown parameterizations, each requiring their own replication of the rest of the model. That aside one pathological component in a mixture model can indeed compromise the accurate fitting of the others, for example by messing up the adaptation.

1 Like

I feel that we’ve lost the intended topic of this thread, which could be formulated as:

Dependencies between more than 2 parameters introduced by non-centering, and whether the info encoded in the parameter offsets could be used to adapt the sampler to this in a way that:

  1. Stan can’t do automatically through its adaptation routine and
  2. seems not simpel/possible to code as a model reparametrisation.

It has become ±clear to me, that this idea only makes sense when running Stan with a dense metric. Which doesn’t seem too common? At least its not the default. So I think I kinda answered my own question. The answer beeing: It’s probably not worth the effort, too niche.

Towards the “other topic”:
People seem to respond to my comments that I’m having some difficulties with a particular model. (Note again: This is not the example model from the opening post. It’s a structural time series model.)

I’m familiar with this article and have checked/tested my model for these things.

My data is quite balanced so a mixed parameterisation doesn’t seem promising to me.

Yes, I’ve been thinking about these things and tested many different variants.

Mine is normal, so I don’t think its applicable.

I’m not seeing any bad geometries in pair plots. Must be some more complicated parameter dependency. I think my next step will be to try running the model in the testing environment of another project of mine, where I’m in the process of developing some new model diagnostics. That requires some effort and may even not work at all. But if it works maybe it’ll help me find out what’s going wrong.

2 Likes

It’s important to keep in mind that a dense metric can account for only linear correlations between the parameters, and linear correlations rarely obstruct exploration enough to cause divergences. Moreover moving between centered and non-centered parameterizations usually results in nonlinear correlations in the posterior distribution.

The behavior that you’re seeing is due to partial pooling, as discussed in Section 3.2.2 of the case study, Hierarchical Modeling. Although the hierarchical coupling between the \tilde{\theta}_{k} and \tau in the non-centered parameterization results in a strong non-linear inverted funnel geometry, the marginal information learned about \tau can cut off the top and bottom of funnel and the worst of that geometry.

With enough partial pooling the resulting posterior degeneracy won’t be exactly linear, but it may be approximately so in which case a dense metric might lead to slightly more efficient sampling performance. But again it’s very unlikely that the dense metric sampler would admit a Markov chain Monte Carlo central limit theorem while the diagonal metric sampler would not, resulting in divergences an other problems that show up only when running the diagonal metric sampler.

While the growing awareness of the standard hierarchal degeneracies has been good overall, it can also lead people astray when the source of observed degenerate behavior is caused by something else; divergences in a hierarchical model aren’t always caused by the hierarchy! If none of the hierarchal pair plots (theta/theta_tilde vs tau, mu vs tau) exhibit any particular funnel-like behavior then it’s likely that the degenerate behavior is being caused by another part of the model, especially if the observational model is complex.

1 Like