Centered or non-centered parametrization for random effects

Hi,

It looks like pretty well established fact that non centered parametrization of y~N(mu,sigma) as y=mu+sigmaN(0,1) works much better in stan than centered counterpart. What about random effects: theta~N(mu_theta,sigma_theta)? I found that when using non-centered parametrization theta=mu_theta+sigma_thetaN(0,1) multiple modes start to appear. Can anybody share their experiences? Perhaps there is some rationale not to use non-centered parametrization for random effects.

Thanks for any insights.

Linas

No, that’s not the case for y being data; the built in y ~ normal(mu, sigma) will be better (and we don’t support N(...).

The issue only arises when theta is a parameter. And it’s not always better, it’s better when there’s not much data.

Michael and Mark Girolami not only share their experiences, but a detailed analysis here:

https://arxiv.org/abs/1312.0906

There’s a book version, but it’s not open access as far as I can tell.

Bob, many thanks. Is it possible then that appearance of multiple modes
may be caused by correlation? Some parameters are correlated (Vmax_jej
and Km_acat), some are bounded because of physical constrains (Vmax_kid
and Km_kid).

Below this chart is a pairs plot for random effects where multiple modes
show up.
Each chain is shown in different color. The model is one way normal
model somewhat similar to Michael and Mark Girolami paper.

which was rewritten as

where i is individual, n is the size of parameter vector theta, f is a
system of ODEs describing PBPK model. u.k.i corresponds to nu_ik~N(0,1),
sigma_ind.k is tau_k.
Or maybe rewriting the model caused multiple modes?

Those L shapes aren’t necessarily from multiple modes—they may just be due to multiplicative non-identifiabilities (e.g, consider trying to fit y ~ normal(mu1 * mu2, sigma) and what mu1 and mu2 will look like in the posterior. Stan and other systems struggle with these elbow-like shapes because of the varying curvature (think of covariance at the different points).

The real problems you’re going to have with sampling are those very correlated Vmax and Km parameters (one pair is very highly correlated and another pair appears to have a hard boundary constraint).

Bob,

Many thanks. Yes, those correlations give nasty traces and low n_eff while
r-hat is good.

Back to original question. Which model is “better” for stan to handle?
Probably second but doesn’t it introduce multiplicative
nonidentifiabilities I experience. In fact I tried both and the first one
seems to give bunch of divergencies while the second one gives L shapes.

or

Linas

Depends on how strong the data is. See the paper by Betancourt and Girolami for details:

https://arxiv.org/abs/1312.0906

Problem with noncentered parametrization that it gives chains that are far
from reasonable (in fact one chain from 4). So I came up with the approach
below but not sure if it is not cheating:

  • generate parameter estimates with centered parametrization, with
    informative priors for parameters documented in the literature and vague
    priors for remaining parameters
  • set upper and lower bounds on those parameters based on the range of the
    corresponding parameter
  • generate parameter estimates with noncentered parametrization

The centered parametrization has lots of divergencies but provides some
bounds of possible ranges. Noncentered parametrization has no divergencies.

It works but I am not sure if such approach is legitimate.

Are you sure that the non-centered parameterization isn’t the right one? Or are you saying that the different chains don’t agree with non-centering? The only way you’ll be able to evaluate this is with simulated data.

We recommend against hard upper and lower bounds, no matter where they come from. Instead, we recommend unoncstrained parameters unless physical constraints (like being a probability parameter to a binomial) impose constraints. But we do recommend coupling these with reasonable priors that will provide softer versions of hard interval bounds.

Usually non-centered parametrization (at least in the papers I saw) is for
y~normal(mu,sigma). It becomes y=mu+usigma where u~normal(0,1). In my case
y_ind~normal(f(theta_ind),sigma) where theta_ind~normal(theta,sigma_ind).
After non-centering theta_ind it becomes theta_ind=theta+u
sigma_ind where
u~normal(0,1). There is some subtle difference introduced by non linear
function f. Observation is that centered theta_ind gives bunch of
divergencies but all chains mix well. However, for non-centered version of
theta_ind there are chains which are significantly different than others.
In my case one chain out of 4. While 3 chains mix well and don’t give any
divergencies this one chain is like piggy tail, give bunch of divergencies,
and lies outside ranges defined by well mixing chains. It gives for vmax
and parameters very low value which is not really physical. The purpose of
hard constrain was to remove this low value since the prior should be vague
(nobody really has an idea what values vmax/km should be and for vague
prior there is a non-zero probability to generate a pig tail looking chain
which is outside of physical sensible range). I wonder if because of this
nonlinear function f some other parametrization would be appropriate?

You want this:

u ~ normal(0, 1);
y_ind = f(theta_ind) + sigma * u;

Well, I really need estimate of theta and random effect. theta is
population parameter and is very important concept in PK world. Random
effect is needed for predictive posterior. Basically we take bunch of
subjects and estimate theta and random effect. This becomes prior for
new subject. We measure plasma concentration for this new subject and
using Bayesian inference predict PK profile as well as PD stuff. However
I may retrieve theta and random effect by
theta_ind~normal(theta,sigma_random_effect). Thanks for the guidance.

You can get the individual estimates regardless if you use cp or ncp… just define them in the generated quantities block. Not sure if this is your issue, I was guessing.

Sebastian

Thanks a lot for good suggestions. But at least I got quite confused.
Population PK model may be simplistically expressed as a two level
hierarchical model:
y_it~normal(f(theta_i,t),sigma)
theta_i~normal(theta,tau)
where y_it - plasma concentration for individual i at time t
f - system of ODEs
theta - population parameter

At leas the system of ODEs I have (PBPK) model gives a very good fit -
as verified by predictive posterior - but gives bunch of divergencies.
theta is in reality vector for which some parameters have literature
values while some don’t. Thus I used quite vague prior for those which
don’t and informative prior for those which do.

Then I reparametrized the model:
y_it~normal(f(theta+ksi_i*tau,t),sigma)
ksi_i~normal(0,1)

and ran 4 chains. 3 chains showed a very good mixing but one chain was
like pig tail and was visibly outside the region defined by well mixed
chains. Well mixed chains didn’t have any divergencies but pig tail
chain had a lot of divergencies. Also, well mixed chains didn’t have any
normal_log: Location parameter[238] is 1.#QNAN, but must be finite!
errors after warmup but pig tail chain had them up to the end of
simulation. Those I guess were caused by function f giving nans for
nonsensical parameter values.

Then I did one more experiment - hard constrained parameters based on
ranges as determined from centered parametrization (first model). As a
result there were no divergenciens, perfect mixing etc. However, as I
understand hard constrains are not recommended. But I don’t know other
way how to remove pig tails (except ignoring them).

One more possibility for divergencies might be a very strong correlation
between parameters but it seems this correlation doesn’t cause
divergencies for non centered parametrization (second model).

Second model is attached. I appreciate any feedback. It runs for 5 days
(even when jacobian is supplied to CmdStan, thanks to Sebastian). By the
way, pig tail chain occurred for ncp model when jacobian was not
supplied.

Linas

pbpkauto.stan (11.5 KB)

You could try scaling all your parameters to roughly unit scale. This kind of thing can make adaptation tricky:

   k_leave_1 ~ normal (0.457, 0.001); // derived from Fadda et all

Overall, it’s impossible to see the structure of that huge program.

Yes, this one was tricky. Fadda et all did an experiment to determine
gastric times for 10 subjects. k_leave_1 is log(gastric time). QQ plot
indicated it to be normal. So we took the mean of log(gastric time) for
all subjects as location for k_leave_1 and sd of log(gastric time) as
location for random effect and used very small sds. What is a
best/better practice?

But back to previous question. Is the ncp (model B) enough or there is
still some re-parametrization needed?
model A:
y_it~normal(f(theta_i,t),sigma)
theta_i~normal(theta,tau)

Model B:
y_it~normal(f(theta+ksi_i*tau,t),sigma)
ksi_i~normal(0,1)

where y_it - plasma concentration for individual i at time t
f - system of ODEs
theta - population parameter

The raw parameter can be something like

k_leave_1_raw ~ normal(0.457, 1)

Then in the model use k_leave_1_raw * 0.001 wherever you would have had k_leave_1